'Data_Science' 카테고리의 글 목록 (9 Page)

Data_Science

37. iris || Kmeans 2021.11.25
36. 강남역 고기집 후기분석 || 감성분석 2021.11.25
35. 강남역 고기집 감성분석 || 감성분석, TF-IDF 2021.11.25
34. 강남역 고기집 후기분석 || 맵크로울링 2021.11.25
33. white wine || GBM 2021.11.24
32. titanic || GBM 2021.11.24
31. titanic || logistic 2021.11.24
30. 보스턴 주택가격정보 || 선형회귀 2021.11.24
29. 비트코인 시계열 분석 || prophet 2021.11.24
28. 비트코인 가격 시계열 분석 || Arima, fbProphet 2021.11.24

37. iris || Kmeans

2021. 11. 25. 14:27

728x90

from sklearn import datasets
iris = datasets.load_iris()
type(iris)

# sklearn.utils.Bunch

iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

iris.data
array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.2],
       [5. , 3.2, 1.2, 0.2],
       [5.5, 3.5, 1.3, 0.2],
       [4.9, 3.6, 1.4, 0.1],
       [4.4, 3. , 1.3, 0.2],
       [5.1, 3.4, 1.5, 0.2],
       [5. , 3.5, 1.3, 0.3],
       [4.5, 2.3, 1.3, 0.3],
       [4.4, 3.2, 1.3, 0.2],
       [5. , 3.5, 1.6, 0.6],
       [5.1, 3.8, 1.9, 0.4],
       [4.8, 3. , 1.4, 0.3],
       [5.1, 3.8, 1.6, 0.2],
       [4.6, 3.2, 1.4, 0.2],
       [5.3, 3.7, 1.5, 0.2],
       [5. , 3.3, 1.4, 0.2],
       [7. , 3.2, 4.7, 1.4],
       [6.4, 3.2, 4.5, 1.5],
       [6.9, 3.1, 4.9, 1.5],
       [5.5, 2.3, 4. , 1.3],
       [6.5, 2.8, 4.6, 1.5],
       [5.7, 2.8, 4.5, 1.3],
       [6.3, 3.3, 4.7, 1.6],
       [4.9, 2.4, 3.3, 1. ],
       [6.6, 2.9, 4.6, 1.3],
       [5.2, 2.7, 3.9, 1.4],
       [5. , 2. , 3.5, 1. ],
       [5.9, 3. , 4.2, 1.5],
       [6. , 2.2, 4. , 1. ],
       [6.1, 2.9, 4.7, 1.4],
       [5.6, 2.9, 3.6, 1.3],
       [6.7, 3.1, 4.4, 1.4],
       [5.6, 3. , 4.5, 1.5],
       [5.8, 2.7, 4.1, 1. ],
       [6.2, 2.2, 4.5, 1.5],
       [5.6, 2.5, 3.9, 1.1],
       [5.9, 3.2, 4.8, 1.8],
       [6.1, 2.8, 4. , 1.3],
       [6.3, 2.5, 4.9, 1.5],
       [6.1, 2.8, 4.7, 1.2],
       [6.4, 2.9, 4.3, 1.3],
       [6.6, 3. , 4.4, 1.4],
       [6.8, 2.8, 4.8, 1.4],
       [6.7, 3. , 5. , 1.7],
       [6. , 2.9, 4.5, 1.5],
       [5.7, 2.6, 3.5, 1. ],
       [5.5, 2.4, 3.8, 1.1],
       [5.5, 2.4, 3.7, 1. ],
       [5.8, 2.7, 3.9, 1.2],
       [6. , 2.7, 5.1, 1.6],
       [5.4, 3. , 4.5, 1.5],
       [6. , 3.4, 4.5, 1.6],
       [6.7, 3.1, 4.7, 1.5],
       [6.3, 2.3, 4.4, 1.3],
       [5.6, 3. , 4.1, 1.3],
       [5.5, 2.5, 4. , 1.3],
       [5.5, 2.6, 4.4, 1.2],
       [6.1, 3. , 4.6, 1.4],
       [5.8, 2.6, 4. , 1.2],
       [5. , 2.3, 3.3, 1. ],
       [5.6, 2.7, 4.2, 1.3],
       [5.7, 3. , 4.2, 1.2],
       [5.7, 2.9, 4.2, 1.3],
       [6.2, 2.9, 4.3, 1.3],
       [5.1, 2.5, 3. , 1.1],
       [5.7, 2.8, 4.1, 1.3],
       [6.3, 3.3, 6. , 2.5],
       [5.8, 2.7, 5.1, 1.9],
       [7.1, 3. , 5.9, 2.1],
       [6.3, 2.9, 5.6, 1.8],
       [6.5, 3. , 5.8, 2.2],
       [7.6, 3. , 6.6, 2.1],
       [4.9, 2.5, 4.5, 1.7],
       [7.3, 2.9, 6.3, 1.8],
       [6.7, 2.5, 5.8, 1.8],
       [7.2, 3.6, 6.1, 2.5],
       [6.5, 3.2, 5.1, 2. ],
       [6.4, 2.7, 5.3, 1.9],
       [6.8, 3. , 5.5, 2.1],
       [5.7, 2.5, 5. , 2. ],
       [5.8, 2.8, 5.1, 2.4],
       [6.4, 3.2, 5.3, 2.3],
       [6.5, 3. , 5.5, 1.8],
       [7.7, 3.8, 6.7, 2.2],
       [7.7, 2.6, 6.9, 2.3],
       [6. , 2.2, 5. , 1.5],
       [6.9, 3.2, 5.7, 2.3],
       [5.6, 2.8, 4.9, 2. ],
       [7.7, 2.8, 6.7, 2. ],
       [6.3, 2.7, 4.9, 1.8],
       [6.7, 3.3, 5.7, 2.1],
       [7.2, 3.2, 6. , 1.8],
       [6.2, 2.8, 4.8, 1.8],
       [6.1, 3. , 4.9, 1.8],
       [6.4, 2.8, 5.6, 2.1],
       [7.2, 3. , 5.8, 1.6],
       [7.4, 2.8, 6.1, 1.9],
       [7.9, 3.8, 6.4, 2. ],
       [6.4, 2.8, 5.6, 2.2],
       [6.3, 2.8, 5.1, 1.5],
       [6.1, 2.6, 5.6, 1.4],
       [7.7, 3. , 6.1, 2.3],
       [6.3, 3.4, 5.6, 2.4],
       [6.4, 3.1, 5.5, 1.8],
       [6. , 3. , 4.8, 1.8],
       [6.9, 3.1, 5.4, 2.1],
       [6.7, 3.1, 5.6, 2.4],
       [6.9, 3.1, 5.1, 2.3],
       [5.8, 2.7, 5.1, 1.9],
       [6.8, 3.2, 5.9, 2.3],
       [6.7, 3.3, 5.7, 2.5],
       [6.7, 3. , 5.2, 2.3],
       [6.3, 2.5, 5. , 1.9],
       [6.5, 3. , 5.2, 2. ],
       [6.2, 3.4, 5.4, 2.3],
       [5.9, 3. , 5.1, 1.8]])

import pandas as pd
labels = pd.DataFrame(iris.target)
labels.columns = ['labels']
labels.head()
data['labels'] = labels['labels']

labels['labels'].unique()

array([0, 1, 2])

labels['labels'].value_counts()

2    50
1    50
0    50
Name: labels, dtype: int64

data = pd.DataFrame(iris.data)
data.columns = ['Sepal length', 'Sepal width','Petal length','Petal width']
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Sepal length  150 non-null    float64
 1   Sepal width   150 non-null    float64
 2   Petal length  150 non-null    float64
 3   Petal width   150 non-null    float64
dtypes: float64(4)
memory usage: 4.8 KB

feature = data[['Sepal length','Sepal width']]
feature.head()

	Sepal length	Sepal width
0	5.1	3.5
1	4.9	3.0
2	4.7	3.2
3	4.6	3.1
4	5.0	3.6

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
mo = KMeans(n_clusters=3, algorithm = 'auto')
mo.fit(feature)
predict = pd.DataFrame(mo.predict(feature))
predict.columns = ['predict']
r = pd.concat([feature, predict], axis = 1)
r.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Sepal length  150 non-null    float64
 1   Sepal width   150 non-null    float64
 2   predict       150 non-null    int32  
dtypes: float64(2), int32(1)
memory usage: 3.1 KB

# 예측 데이터 그래프
plt.scatter(r['Sepal length'], r['Sepal width'], c = r['predict'], alpha=0.5)

# 실제 데이터 그래프
plt.scatter(data['Sepal length'], data['Sepal width'], c =data['labels'], alpha=0.5)

from sklearn.metrics import confusion_matrix, accuracy_score
print(accuracy_score(data['labels'].values, r['predict'].values))
print(confusion_matrix(data['labels'].values, r['predict'].values))

0.08
[[ 0  0 50]
 [38 12  0]
 [15 35  0]]

저작자표시 비영리

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

39. Tensorflow 구현 (0)	2021.11.25
38. 학생 점수 분석 \|\| Kmeans (0)	2021.11.25
36. 강남역 고기집 후기분석 \|\| 감성분석 (0)	2021.11.25
35. 강남역 고기집 감성분석 \|\| 감성분석, TF-IDF (0)	2021.11.25
34. 강남역 고기집 후기분석 \|\| 맵크로울링 (0)	2021.11.25

36. 강남역 고기집 후기분석 || 감성분석

2021. 11. 25. 14:05

728x90

review_data.csv

0.01MB

import pandas as pd
df = pd.read_csv('review_data.csv')
df

	score	review	y
0	1	예약할 때는 룸을 주기로 하고 홀을 주고, 덥고, 직원들이 정신이 없어 그 가격에 ...	0
1	5	점심식사 잘했던곳.후식커피한잔 하기도 좋고 주차가능합니다. 음식 맛있고 직원분 친절...	1
2	5	新鮮でおいしいです。	1
3	4	녹는다 녹아	1
4	4	NaN	1
...	...	...	...
75	2	이렇게 대기가 긴 맛집인줄 모르고 갔다가 엄청 기다림 예써라는 어플로 대기 하던데 ...	0
76	1	단짠의 정석. 진짜 정석으로 달고 짬. 질리는 맛. 사장님이랑 와이프로 추정되는 ...	0
77	4	만족스러움! 맛있어용	1
78	1	곱창은 없고 대창만 들어있어서 느끼한데 양념은 너무 매워서 위에 탈이나 고생했습니다ㅠㅠ	0
79	5	대창덮밥도 맛있고 곱도리탕도 맛나요 완전 소주각입니다. 자리가 쫍아서 테이블마다 ...	1
80 rows × 3 columns

import re
def text_cleaning(text) :
    hangul = re.compile('[^ ㄱ-ㅣ가-힣]+')
    result = hangul.sub('', text)
    return result
text_cleaning("abc가나다123 라마사아 123")

'가나다 라마사아 '

df['ko_text'] = df['review'].apply(lambda x : text_cleaning(str(x))) # null 값
df['ko_text']

0     예약할 때는 룸을 주기로 하고 홀을 주고 덥고 직원들이 정신이 없어 그 가격에 내가...
1     점심식사 잘했던곳후식커피한잔 하기도 좋고 주차가능합니다 음식 맛있고 직원분 친절하여...
2                                                      
3                                                녹는다 녹아
4                                                      
                            ...                        
75    이렇게 대기가 긴 맛집인줄 모르고 갔다가 엄청 기다림 예써라는 어플로 대기 하던데 ...
76    단짠의 정석 진짜 정석으로 달고 짬 질리는 맛  사장님이랑 와이프로 추정되는 서빙해...
77                                           만족스러움 맛있어용
78    곱창은 없고 대창만 들어있어서 느끼한데 양념은 너무 매워서 위에 탈이나 고생했습니다ㅠㅠ 
79    대창덮밥도 맛있고 곱도리탕도 맛나요 완전 소주각입니다  자리가 쫍아서 테이블마다 가...
Name: ko_text, Length: 80, dtype: object

df['review'].head()

0    예약할 때는 룸을 주기로 하고 홀을 주고, 덥고, 직원들이 정신이 없어 그 가격에 ...
1    점심식사 잘했던곳.후식커피한잔 하기도 좋고 주차가능합니다. 음식 맛있고 직원분 친절...
2                                           新鮮でおいしいです。
3                                               녹는다 녹아
4                                                  NaN
Name: review, dtype: object

df1 = df.loc[df['ko_text'].apply(lambda x : len(x)) > 0]
df1.isnull().value_counts()

score  review  y      ko_text
False  False   False  False      65
dtype: int64

del df['review']
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80 entries, 0 to 79
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   score    80 non-null     int64 
 1   y        80 non-null     int64 
 2   ko_text  80 non-null     object
dtypes: int64(2), object(1)
memory usage: 2.0+ KB

from konlpy.tag import Okt

# 텍스트 데이터 형태소 추출
def get_pos(x) :
    tagger = Okt()
    pos = tagger.pos(x)
    # word : konlpy 모듈 형태소 분석단어
    # tag : 형태소 분석된 품사
    pos = ['{0}/{1}'.format(word, tag) for word, tag in pos]
    return pos

result = get_pos(df['ko_text'].values[0])
print(result)

['예약/Noun', '할/Verb', '때/Noun', '는/Josa', '룸/Noun', '을/Josa', '주기/Noun', '로/Josa', '하고/Verb', '홀/Noun', '을/Josa', '주고/Verb', '덥고/Adjective', '직원/Noun', '들/Suffix', '이/Josa', '정신/Noun', '이/Josa', '없어/Adjective', '그/Noun', '가격/Noun', '에/Josa', '내/Noun', '가/Josa', '직접/Noun', '구워/Verb', '먹고/Verb', '갈비살/Noun', '등심/Noun', '은/Josa', '질/Noun', '기고/Noun', '냉면/Noun', '은/Josa', '맛/Noun', '이/Josa', '없고/Adjective', '장어/Noun', '양념/Noun', '들/Suffix', '도/Josa', '제/Noun', '때/Noun', '안/Noun', '가져다/Verb', '주고/Verb', '회식/Noun', '으로/Josa', '한/Determiner', '시간/Noun', '만에/Josa', '만원/Noun', '을/Josa', '썼는데/Verb', '이런/Adjective', '경험/Noun', '처음/Noun', '입니다/Adjective']

from sklearn.feature_extraction.text import CountVectorizer
                     #글뭉치(corpus) 인덱스로 생성
index_vectorizer = CountVectorizer(tokenizer = lambda x : get_pos(x))
# 
 # 형태소분석하고 단어품사 분리
x = index_vectorizer.fit_transform(df['ko_text'].tolist())
x.shape

# (80, 779)

from sklearn.feature_extraction.text import TfidfTransformer
# 글뭉치, 형태소분석의 단어
tfidf_vectorizer =  TfidfTransformer()
x = tfidf_vectorizer.fit_transform(x)
print(x.shape)

# (80, 779)

# 긍부정 리뷰분류
# 데이터셋 분리
from sklearn.model_selection import train_test_split
y = df['y']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.30)

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state = 0)
lr.fit(x_train, y_train)
y_pred = lr.predict(x_test)

x_train.shape
# (56, 779)

len(lr.coef_[0])
# 779

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 8]
plt.bar(range(len(lr.coef_[0])), lr.coef_[0])

# lr.coef_[0] 내림차순
# 상위 양수 : 긍정적인 단어, 긍정가중치
sorted(((value, index) for index, value in enumerate(lr.coef_[0])), reverse=True)[:5]

[(0.31921037916122147, 269),
 (0.31181674718077157, 266),
 (0.31181674718077157, 2),
 (0.22722099938708767, 778),
 (0.22499528665817484, 719)]

# 하위 음수 : 부정적인 단어, 부정 가중치값 5개
sorted(((value, index) for index, value in enumerate(lr.coef_[0])), reverse=True)[-5:]

[(-0.3303223649310512, 736),
 (-0.35074686047120107, 374),
 (-0.35074686047120107, 80),
 (-0.3756982096297823, 538),
 (-0.3907079128326151, 147)]

coef_pos_index = sorted(((value, index) for index, value in enumerate(lr.coef_[0])), reverse=True)
invert_index_vectorizer = { v : k for k, v in index_vectorizer.vocabulary_.items()}
cnt = 0
for k, v in index_vectorizer.vocabulary_.items() :
    print(k, v)
    cnt += 1
    if cnt >= 10 :
        break
# index_vectorizer 글뭉치 데이터
# invert_index_vectorizer 피처 인덱스 : 단어 / 품사

예약/Noun 504
할/Verb 743
때/Noun 224
는/Josa 162
룸/Noun 236
을/Josa 538
주기/Noun 631
로/Josa 235
하고/Verb 721
홀/Noun 769

# 회귀모델의 계수를 index_vectorizer에 맵핑하여, 어떤 형태소인지 출력
for coef in coef_pos_index[:20] :
    print(invert_index_vectorizer[coef[1]], coef[0]) # 단어 품사
#      피처 인덱스값 가져와서, ceof 가중치 가져오기

맛있어요/Adjective 0.31921037916122147
맛있댜/Noun 0.31181674718077157
ㅈㅁㅌㅌㄱㄹ/KoreanParticle 0.31181674718077157
흠/Noun 0.22722099938708767
하/Suffix 0.22499528665817484
비싸다으/Adjective 0.22048773641905486
맛잇으느/Noun 0.22048773641905486
녹아/Verb 0.22048773641905486
녹는다/Verb 0.22048773641905486
탕/Noun 0.2164660184839546
도리/Noun 0.2164660184839546
아이스크림/Noun 0.21489782468607826
후식/Noun 0.2005066237398065
매번/Noun 0.19501461266793108
맛있어용/Adjective 0.19425368225749348
만족스러/Adjective 0.19425368225749348
삼겹/Noun 0.19379228613545138
떡/Noun 0.19269205269234155
맛있네요/Adjective 0.19081949184719
닭갈비/Noun 0.1884669938903386

for coef in coef_pos_index[-20:] :
    print(invert_index_vectorizer[coef[1]], coef[0])
    
할말은/Verb -0.24837542600818435
않습니다/Verb -0.24837542600818435
많지만/Adjective -0.24837542600818435
그냥/Noun -0.2558545368223156
내/Noun -0.2666707017208134
먹기/Noun -0.2691943240304982
불친절해요/Adjective -0.282816560343047
해줌/Verb -0.2946023898159785
편하게/Adjective -0.2946023898159785
ㅜㅜ/KoreanParticle -0.29915590661697383
요/Josa -0.30167857441633256
평범함/Adjective -0.32572092898704597
무질/Noun -0.3292976179312531
너/Modifier -0.3292976179312531
겨/Noun -0.3292976179312531
하지/Verb -0.3303223649310512
비싸긴한데/Adjective -0.35074686047120107
괜찮아요/Adjective -0.35074686047120107
을/Josa -0.3756982096297823
너무/Adverb -0.3907079128326151

# 명사 기준으로 긍정 10개, 부정 10개
noun_list=[]
for coef in coef_pos_index :
    category = invert_index_vectorizer[coef[1]].split("/")[1] # 이름 가져오고 split
    if category == 'Noun' :
        noun_list.append((invert_index_vectorizer[coef[1]], coef[0]))
noun_list[:10]

[('맛있댜/Noun', 0.31181674718077157),
 ('흠/Noun', 0.22722099938708767),
 ('맛잇으느/Noun', 0.22048773641905486),
 ('탕/Noun', 0.2164660184839546),
 ('도리/Noun', 0.2164660184839546),
 ('아이스크림/Noun', 0.21489782468607826),
 ('후식/Noun', 0.2005066237398065),
 ('매번/Noun', 0.19501461266793108),
 ('삼겹/Noun', 0.19379228613545138),
 ('떡/Noun', 0.19269205269234155)]

# 형용사 기준으로 긍정 10개, 부정 10개
adjective_list=[]
for coef in coef_pos_index :
    category = invert_index_vectorizer[coef[1]].split("/")[1] # 이름 가져오고 split
    if category == 'Adjective' :
        adjective_list.append((invert_index_vectorizer[coef[1]], coef[0]))
adjective_list[:10]      

[('맛있어요/Adjective', 0.31921037916122147),
 ('비싸다으/Adjective', 0.22048773641905486),
 ('맛있어용/Adjective', 0.19425368225749348),
 ('만족스러/Adjective', 0.19425368225749348),
 ('맛있네요/Adjective', 0.19081949184719),
 ('맛있고/Adjective', 0.16450683695840304),
 ('맛있게/Adjective', 0.16330050009866345),
 ('좋음/Adjective', 0.1431229621617376),
 ('정갈하게/Adjective', 0.1431229621617376),
 ('비싸지만/Adjective', 0.1431229621617376)]

저작자표시 비영리

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

38. 학생 점수 분석 \|\| Kmeans (0)	2021.11.25
37. iris \|\| Kmeans (0)	2021.11.25
35. 강남역 고기집 감성분석 \|\| 감성분석, TF-IDF (0)	2021.11.25
34. 강남역 고기집 후기분석 \|\| 맵크로울링 (0)	2021.11.25
33. white wine \|\| GBM (0)	2021.11.24

35. 강남역 고기집 감성분석 || 감성분석, TF-IDF

2021. 11. 25. 13:56

728x90

감성분석

import pandas as pd
df = pd.read_csv('review_data.csv')
df

	score	review	y
0	1	예약할 때는 룸을 주기로 하고 홀을 주고, 덥고, 직원들이 정신이 없어 그 가격에 ...	0
1	5	점심식사 잘했던곳.후식커피한잔 하기도 좋고 주차가능합니다. 음식 맛있고 직원분 친절...	1
2	5	新鮮でおいしいです。	1
3	4	녹는다 녹아	1
4	4	NaN	1
...	...	...	...
75	2	이렇게 대기가 긴 맛집인줄 모르고 갔다가 엄청 기다림 예써라는 어플로 대기 하던데 ...	0
76	1	단짠의 정석. 진짜 정석으로 달고 짬. 질리는 맛. 사장님이랑 와이프로 추정되는 ...	0
77	4	만족스러움! 맛있어용	1
78	1	곱창은 없고 대창만 들어있어서 느끼한데 양념은 너무 매워서 위에 탈이나 고생했습니다ㅠㅠ	0
79	5	대창덮밥도 맛있고 곱도리탕도 맛나요 완전 소주각입니다. 자리가 쫍아서 테이블마다 ...	1
80 rows × 3 columns

import re
def text_cleaning(text) :
    hangul = re.compile('[^ ㄱ-ㅣ가-힣]+')
    result = hangul.sub('', text)
    return result
text_cleaning("abc가나다123 라마사아 123")

'가나다 라마사아 '

df['ko_text'] = df['review'].apply(lambda x : text_cleaning(str(x))) # null 값
df['ko_text']

0     예약할 때는 룸을 주기로 하고 홀을 주고 덥고 직원들이 정신이 없어 그 가격에 내가...
1     점심식사 잘했던곳후식커피한잔 하기도 좋고 주차가능합니다 음식 맛있고 직원분 친절하여...
2                                                      
3                                                녹는다 녹아
4                                                      
                            ...                        
75    이렇게 대기가 긴 맛집인줄 모르고 갔다가 엄청 기다림 예써라는 어플로 대기 하던데 ...
76    단짠의 정석 진짜 정석으로 달고 짬 질리는 맛  사장님이랑 와이프로 추정되는 서빙해...
77                                           만족스러움 맛있어용
78    곱창은 없고 대창만 들어있어서 느끼한데 양념은 너무 매워서 위에 탈이나 고생했습니다ㅠㅠ 
79    대창덮밥도 맛있고 곱도리탕도 맛나요 완전 소주각입니다  자리가 쫍아서 테이블마다 가...
Name: ko_text, Length: 80, dtype: object

df['review'].head()

0    예약할 때는 룸을 주기로 하고 홀을 주고, 덥고, 직원들이 정신이 없어 그 가격에 ...
1    점심식사 잘했던곳.후식커피한잔 하기도 좋고 주차가능합니다. 음식 맛있고 직원분 친절...
2                                           新鮮でおいしいです。
3                                               녹는다 녹아
4                                                  NaN
Name: review, dtype: object

df1 = df.loc[df['ko_text'].apply(lambda x : len(x)) > 0]
df1.isnull().value_counts()

score  review  y      ko_text
False  False   False  False      65
dtype: int64

del df['review']
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80 entries, 0 to 79
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   score    80 non-null     int64 
 1   y        80 non-null     int64 
 2   ko_text  80 non-null     object
dtypes: int64(2), object(1)
memory usage: 2.0+ KB

from konlpy.tag import Okt

# 텍스트 데이터 형태소 추출
def get_pos(x) :
    tagger = Okt()
    pos = tagger.pos(x)
    # word : konlpy 모듈 형태소 분석단어
    # tag : 형태소 분석된 품사
    pos = ['{0}/{1}'.format(word, tag) for word, tag in pos]
    return pos

result = get_pos(df['ko_text'].values[0])
print(result)

['예약/Noun', '할/Verb', '때/Noun', '는/Josa', '룸/Noun', '을/Josa', '주기/Noun', '로/Josa', '하고/Verb', '홀/Noun', '을/Josa', '주고/Verb', '덥고/Adjective', '직원/Noun', '들/Suffix', '이/Josa', '정신/Noun', '이/Josa', '없어/Adjective', '그/Noun', '가격/Noun', '에/Josa', '내/Noun', '가/Josa', '직접/Noun', '구워/Verb', '먹고/Verb', '갈비살/Noun', '등심/Noun', '은/Josa', '질/Noun', '기고/Noun', '냉면/Noun', '은/Josa', '맛/Noun', '이/Josa', '없고/Adjective', '장어/Noun', '양념/Noun', '들/Suffix', '도/Josa', '제/Noun', '때/Noun', '안/Noun', '가져다/Verb', '주고/Verb', '회식/Noun', '으로/Josa', '한/Determiner', '시간/Noun', '만에/Josa', '만원/Noun', '을/Josa', '썼는데/Verb', '이런/Adjective', '경험/Noun', '처음/Noun', '입니다/Adjective']

from sklearn.feature_extraction.text import CountVectorizer
                     #글뭉치(corpus) 인덱스로 생성
index_vectorizer = CountVectorizer(tokenizer = lambda x : get_pos(x))
# 
 # 형태소분석하고 단어품사로 분리
x = index_vectorizer.fit_transform(df['ko_text'].tolist())
x.shape

# (80, 779)

for a in x[:10] :
    print(a)
    
(0, 504)	1
  (0, 743)	1
  (0, 224)	2
  (0, 162)	1
  (0, 236)	1
  (0, 538)	3
  (0, 631)	1
  (0, 235)	1
  (0, 721)	1
  (0, 769)	1
  (0, 629)	2
  (0, 189)	1
  (0, 650)	1
  (0, 210)	2
  (0, 546)	3
  (0, 609)	1
  (0, 485)	1
  (0, 97)	1
  (0, 18)	1
  (0, 491)	1
  (0, 141)	1
  (0, 13)	1
  (0, 651)	1
  (0, 87)	1
  (0, 281)	1
  (0, 34)	1
  (0, 222)	1
  (0, 537)	2
  (0, 653)	1
  (0, 107)	1
  (0, 145)	1
  (0, 258)	1
  (0, 481)	1
  (0, 588)	1
  (0, 468)	1
  (0, 192)	1
  (0, 610)	1
  (0, 453)	1
  (0, 29)	1
  (0, 772)	1
  (0, 536)	1
  (0, 738)	1
  (0, 417)	1
  (0, 250)	1
  (0, 251)	1
  (0, 439)	1
  (0, 551)	1
  (0, 61)	1
  (0, 672)	1
  (0, 573)	1
  (0, 650)	1
  (0, 13)	1
  (0, 604)	1
  (0, 585)	1
  (0, 761)	1
  (0, 79)	1
  (0, 776)	1
  (0, 691)	1
  (0, 723)	1
  (0, 618)	1
  (0, 635)	1
  (0, 22)	1
  (0, 540)	1
  (0, 261)	1
  (0, 363)	1
  (0, 689)	1
  (0, 600)	1
  (0, 321)	1
  (0, 648)	1

  (0, 154)	1
  (0, 155)	1


  (0, 162)	1
  (0, 546)	1
  (0, 491)	1
  (0, 192)	1
  (0, 251)	1
  (0, 672)	1
  (0, 529)	1
  (0, 238)	1
  (0, 451)	1
  (0, 454)	2
  (0, 443)	1
  (0, 444)	1
  (0, 506)	1
  (0, 247)	1
  (0, 516)	1
  (0, 397)	1
  (0, 49)	1
  (0, 129)	1
  (0, 641)	1
  (0, 333)	1
  (0, 66)	1
  (0, 511)	1
  (0, 116)	1
  (0, 54)	2
  (0, 318)	1
  (0, 643)	1
  (0, 509)	1
  (0, 460)	1
  (0, 547)	1
  (0, 58)	1
  (0, 409)	1
  (0, 569)	1
  (0, 71)	1
  (0, 446)	1
  (0, 301)	1
  (0, 265)	1
  (0, 649)	1
  (0, 191)	1
  (0, 168)	1
  (0, 510)	1
  (0, 48)	1
  (0, 660)	1
  (0, 389)	1
  (0, 657)	1
  (0, 186)	1
  (0, 132)	1
  (0, 538)	2
  (0, 454)	1
  (0, 66)	1
  (0, 450)	1
  (0, 300)	1
  (0, 246)	1
  (0, 527)	1
  (0, 477)	1
  (0, 237)	1
  (0, 285)	2
  (0, 62)	1
  (0, 88)	1
  (0, 337)	1
  (0, 159)	1
  (0, 314)	1
  (0, 352)	1
  (0, 652)	1
  (0, 373)	1
  (0, 437)	1
  (0, 142)	1
  (0, 663)	1
  (0, 637)	1
  (0, 382)	1
  (0, 504)	1
  (0, 546)	2
  (0, 491)	2
  (0, 536)	1
  (0, 49)	1
  (0, 531)	1
  (0, 178)	1
  (0, 599)	1
  (0, 326)	1
  (0, 628)	1
  (0, 297)	1
  (0, 577)	1
  (0, 68)	1
  (0, 457)	1
  (0, 483)	1
  (0, 746)	1
  (0, 669)	1
  (0, 597)	1
  (0, 690)	1
  (0, 494)	1
  (0, 463)	1
  (0, 632)	1
  (0, 239)	1
  (0, 165)	1
  (0, 695)	1
  (0, 213)	1
  (0, 367)	1
  (0, 296)	1
  (0, 298)	1
  (0, 475)	1
  (0, 727)	1
  (0, 713)	1
  (0, 399)	1
  (0, 702)	1
  (0, 412)	1
  (0, 182)	1
  (0, 567)	1
  (0, 255)	1
  (0, 358)	1
  (0, 346)	1
  (0, 18)	1
  (0, 13)	1
  (0, 573)	1
  (0, 397)	1
  (0, 129)	1
  (0, 182)	1
  (0, 428)	1
  (0, 774)	1
  (0, 542)	1
  (0, 147)	1
  (0, 339)	1

print(str(index_vectorizer.vocabulary_)[:60]+"..")

{'예약/Noun': 504, '할/Verb': 743, '때/Noun': 224, '는/Josa': 162..

# TF-IDF 변환

# TF : 1개 텍스트에 맛집 3번 있으면 3
# IDF : INVERSE역산 DF
# 모든 데이터에서 맛집단어가 10번이 존재, 0.1값
# TF - IDF 전체문서에서 나타나지 않지만 현재문서에서 많이 나타나면
# 그 단어가 현재문서에서 중요한 단어로 판단

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_vectorizer =  TfidfTransformer()
x = tfidf_vectorizer.fit_transform(x)
print(x.shape)
print(x[0])

(80, 779)
  (0, 772)	0.13918867813287145
  (0, 769)	0.13918867813287145
  (0, 743)	0.13918867813287145
  (0, 738)	0.12718431152605908
  (0, 721)	0.12718431152605908
  (0, 672)	0.10666271126619092
  (0, 653)	0.12718431152605908
  (0, 651)	0.12718431152605908
  (0, 650)	0.09465834465937853
  (0, 631)	0.13918867813287145
  (0, 629)	0.2783773562657429
  (0, 610)	0.12718431152605908
  (0, 609)	0.13918867813287145
  (0, 588)	0.13918867813287145
  (0, 573)	0.11206059819730396
  (0, 551)	0.11866707787300333
  (0, 546)	0.22748699966260583
  (0, 538)	0.31998813379857277
  (0, 537)	0.17228222201264556
  (0, 536)	0.09814547761313519
  (0, 504)	0.12718431152605908
  (0, 491)	0.07253600802895468
  (0, 485)	0.12718431152605908
  (0, 481)	0.11866707787300333
  (0, 468)	0.11866707787300333
  (0, 453)	0.11206059819730396
  (0, 439)	0.13918867813287145
  (0, 417)	0.11866707787300333
  (0, 281)	0.11206059819730396
  (0, 258)	0.07762387735326703
  (0, 251)	0.11866707787300333
  (0, 250)	0.13918867813287145
  (0, 236)	0.13918867813287145
  (0, 235)	0.10209886289470016
  (0, 224)	0.19629095522627038
  (0, 222)	0.12718431152605908
  (0, 210)	0.21332542253238185
  (0, 192)	0.07101739767756768
  (0, 189)	0.13918867813287145
  (0, 162)	0.08377133371130897
  (0, 145)	0.13918867813287145
  (0, 141)	0.12718431152605908
  (0, 107)	0.12718431152605908
  (0, 97)	0.11206059819730396
  (0, 87)	0.12718431152605908
  (0, 61)	0.13918867813287145
  (0, 34)	0.13918867813287145
  (0, 29)	0.13918867813287145
  (0, 18)	0.11866707787300333
  (0, 13)	0.08377133371130897

# 긍부정 리뷰분류
# 데이터셋 분리
from sklearn.model_selection import train_test_split
y = df['y']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)

x_train.shape
# (56, 779)

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state = 0)
lr.fit(x_train, y_train)
y_pred = lr.predict(x_test)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print("accuracy :%.2f" %accuracy_score(y_test, y_pred)) # (TP+TN) / TP+TN+FP+FN
print("precision_score :%.2f" %precision_score(y_test, y_pred))
print("recall_score :%.2f" %recall_score(y_test, y_pred))
print("f1_score :%.2f" %f1_score(y_test, y_pred))

accuracy :0.58
precision_score :0.57
recall_score :1.00
f1_score :0.72

# (TP+TN) / TP+TN+FP+FN
# print("accuracy :%.2f" %accuracy_score(y_test, y_pred))
=> 그냥 다 TRUE로 하면 90%인데?

# TP / TP+FP
# print("precision_score :%.2f" %precision_score(y_test, y_pred))
=> 얼마나 적절하게 맞췄는가?, TRUE 예측 중에 실제 TRUE

# (TP+TN) / TP+FN
# print("recall_score :%.2f" %recall_score(y_test, y_pred))

# print("f1_score :%.2f" %f1_score(y_test, y_pred))

from sklearn.metrics import confusion_matrix
confmat = confusion_matrix(y_test, y_pred)
print(confmat)

[[ 1 10]
 [ 0 13]]

실0[][]
실1[][]
  예0예1
TP : 실제 T, 예측 P // 진짜 posi 54
TN : 실제 F, 예측 N // 진짜 nega8
FP : 실제 F, 예측 P 가짜 posi 31
FN : 실제 T, 예측 n 가짜 false 1
F[TN][FP]
T[FN][TP]
  N P
정확도 62(54+8) / 94(54+8+31+1)  = 0.659
정밀도 54 / 85(54+31) = 0.635
재현율 54 / 55(54 + 1) = 0.9818
F1score 2*( 0.635* 0.9818) / ( 0.635+ 0.9818)

# 특이도 specificity 모델이 false로 예측한 정답중 실제 false
tn / tn+fp

roc
roc
y : tpr true positive rate 진짜 양성비율
x : Fpr false positive rate 가짜 양성비율 1-특이율
auc (Area under the curve) : 곡선 아래 면적

# roc
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
# y_pred 예측값
# y_pred_probability 예측값의 확률값
y_pred_probability = lr.predict_proba(x_test)[:,1]
false_positive_rate, true_positive_rate, thresholds = \
            roc_curve(y_test, y_pred_probability)
roc_auc = roc_auc_score(y_test, y_pred_probability)
print('AUC : %.3f' % roc_auc)

plt.rcParams['figure.figsize'] = [5, 4]
plt.plot(false_positive_rate, true_positive_rate, \
         label='ROC  Curve(area = %0.3f)' % roc_auc, 
         color = 'red', linewidth=4.0)
plt.plot([0,1], [0,1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve of Logistic regression')
plt.legend(loc='lower right')

저작자표시 비영리

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

37. iris \|\| Kmeans (0)	2021.11.25
36. 강남역 고기집 후기분석 \|\| 감성분석 (0)	2021.11.25
34. 강남역 고기집 후기분석 \|\| 맵크로울링 (0)	2021.11.25
33. white wine \|\| GBM (0)	2021.11.24
32. titanic \|\| GBM (0)	2021.11.24

34. 강남역 고기집 후기분석 || 맵크로울링

2021. 11. 25. 13:38

728x90

맵 크로울링

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from selenium import webdriver
from bs4 import BeautifulSoup
import re
import time

path = "C:/R/chromedriver"
source_url = "https://map.kakao.com/"
driver = webdriver.Chrome(path)
driver.get(source_url) 
# 검색창
searchbox = driver.find_element_by_xpath("//*[@id='search.keyword.query']") 
# // input 가장 처음 input 찾기 , @ 속성표시
searchbox.send_keys("강남역 고기집")
searchbutton = driver.find_element_by_xpath("//*[@id='search.keyword.submit']")

driver.execute_script("arguments[0].click();", searchbutton)
time.sleep(1)

html = driver.page_source

soup = BeautifulSoup(html, "html.parser")
# 페이지 url 
moreviews = soup.find_all(name = "a", attrs = {"class":"moreview"})
page_urls = []
for moreview in moreviews :
    page_url = moreview.get("href")
    print(page_url)
    page_urls.append(page_url)
driver.close()


https://place.map.kakao.com/85570955
https://place.map.kakao.com/1503746075
https://place.map.kakao.com/95713992
https://place.map.kakao.com/741391811
https://place.map.kakao.com/2011092566
https://place.map.kakao.com/13573220
https://place.map.kakao.com/2062959414
https://place.map.kakao.com/1648266796
https://place.map.kakao.com/168079537
https://place.map.kakao.com/263830255
https://place.map.kakao.com/27238067
https://place.map.kakao.com/26431943
https://place.map.kakao.com/1780387311
https://place.map.kakao.com/1907052666
https://place.map.kakao.com/1052874675
https://place.map.kakao.com/1576421052

# for p in page_urls :
#     print(p)
columns = ['score','review']
df = pd.DataFrame(columns = columns)
driver = webdriver.Chrome(path)
for page in page_urls :
    driver.get(page)
    time.sleep(1.5)
    html = driver.page_source
    soup = BeautifulSoup(html, "html.parser")
    # 리뷰
    contents_div = soup.find(name = "div", attrs={"class":"evaluation_review"})
    # 평점
    rates = contents_div.find_all(name="em", attrs={"class":"num_rate"})
    # 리뷰
    reviews = contents_div.find_all(name = "p", attrs={"class":"txt_comment"})
    print(rates.text)
    for rate, review in zip(rates, reviews) :
        row = [rate.text[0], review.find(name="span").text]
        series = pd.Series(row, index=df.columns)
        df = df.append(series, ignore_index=True)
        
    for button_num in range(2, 6) :
        try :
            another_reviews = driver.find_element_by_xpath\
                ("//a[@data-page='"+str(button_num)+"']")
            another_reviews.click()
            time.sleep(1.5)
            html = driver.page_source
            soup = BeautifulSoup(html, 'html.parser')
            
            contents_div = soup.find\
                (name="div", attrs={"class":"evaluation_reivew"})
            rates = contents_div.find_all\
                (name = "em", attrs = {"class":"num_rate"})
            raviews = contents_div.find_all\
                (name = "p", attrs = {"class":"txt_comment"})
            
            for rate, review in zip(rates, reviews) :
                row = [rate.text[0], review.find(name="span").text]
                series = pd.Series(row, index=df.columns)
                df = df.append(series, ignore_index=True)
        except :
            break
driver.close()



[<em class="num_rate">1<span class="screen_out">점</span></em>, <em class="num_rate">5<span class="screen_out">점</span></em>, <em class="num_rate">5<span class="screen_out">점</span></em>, <em class="num_rate">4<span class="screen_out">점</span></em>, <em class="num_rate">4<span class="screen_out">점</span></em>]
[<p class="txt_comment "><span>예약할 때는 룸을 주기로 하고 홀을 주고, 덥고, 직원들이 정신이 없어 그 가격에 내가 직접 구워먹고 갈비살, 등심은 질기고 냉면은 맛이 없고 장어 양념들도 제 때 안 가져다 주고 회식으로 한시간만에 120만원을 썼는데 이런 경험 처음입니다.</span><button class="btn_fold" type="button">더보기</button></p>, <p class="txt_comment "><span>점심식사 잘했던곳.후식커피한잔 하기도 좋고 주차가능합니다. 음식 맛있고 직원분 친절하여 절로 미소가 지어졌어요. </span><button class="btn_fold" type="button">더보기</button></p>, <p class="txt_comment "><span>新鮮でおいしいです。</span><button class="btn_fold" type="button">더보기</button></p>, <p class="txt_comment "><span>녹는다 녹아</span><button class="btn_fold" type="button">더보기</button></p>, <p class="txt_comment "><span></span><button class="btn_fold" type="button">더보기</button></p>]

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80 entries, 0 to 79
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   score   80 non-null     object
 1   review  80 non-null     object
dtypes: object(2)
memory usage: 1.4+ KB

df.head()

	score	review
0	1	예약할 때는 룸을 주기로 하고 홀을 주고, 덥고, 직원들이 정신이 없어 그 가격에 ...
1	5	점심식사 잘했던곳.후식커피한잔 하기도 좋고 주차가능합니다. 음식 맛있고 직원분 친절...
2	5	新鮮でおいしいです。
3	4	녹는다 녹아
4	4

# 긍부정 평가
df['y'] = df['score'].apply(lambda x : 1 if float(x) > 3 else 0)
df

	score	review	y
0	1	예약할 때는 룸을 주기로 하고 홀을 주고, 덥고, 직원들이 정신이 없어 그 가격에 ...	0
1	5	점심식사 잘했던곳.후식커피한잔 하기도 좋고 주차가능합니다. 음식 맛있고 직원분 친절...	1
2	5	新鮮でおいしいです。	1
3	4	녹는다 녹아	1
4	4		1
...	...	...	...
75	2	이렇게 대기가 긴 맛집인줄 모르고 갔다가 엄청 기다림 예써라는 어플로 대기 하던데 ...	0
76	1	단짠의 정석. 진짜 정석으로 달고 짬. 질리는 맛. 사장님이랑 와이프로 추정되는 ...	0
77	4	만족스러움! 맛있어용	1
78	1	곱창은 없고 대창만 들어있어서 느끼한데 양념은 너무 매워서 위에 탈이나 고생했습니다ㅠㅠ	0
79	5	대창덮밥도 맛있고 곱도리탕도 맛나요 완전 소주각입니다. 자리가 쫍아서 테이블마다 ...	1
80 rows × 3 columns

df.y.value_counts()

1    44
0    36
Name: y, dtype: int64

df.to_csv('review_data.csv', index=False)

review_data.csv

0.01MB

감성분석

import pandas as pd
df = pd.read_csv('review_data.csv')
df

	score	review	y
0	1	예약할 때는 룸을 주기로 하고 홀을 주고, 덥고, 직원들이 정신이 없어 그 가격에 ...	0
1	5	점심식사 잘했던곳.후식커피한잔 하기도 좋고 주차가능합니다. 음식 맛있고 직원분 친절...	1
2	5	新鮮でおいしいです。	1
3	4	녹는다 녹아	1
4	4	NaN	1
...	...	...	...
75	2	이렇게 대기가 긴 맛집인줄 모르고 갔다가 엄청 기다림 예써라는 어플로 대기 하던데 ...	0
76	1	단짠의 정석. 진짜 정석으로 달고 짬. 질리는 맛. 사장님이랑 와이프로 추정되는 ...	0
77	4	만족스러움! 맛있어용	1
78	1	곱창은 없고 대창만 들어있어서 느끼한데 양념은 너무 매워서 위에 탈이나 고생했습니다ㅠㅠ	0
79	5	대창덮밥도 맛있고 곱도리탕도 맛나요 완전 소주각입니다. 자리가 쫍아서 테이블마다 ...	1
80 rows × 3 columns

import re
def text_cleaning(text) :
    hangul = re.compile('[^ ㄱ-ㅣ가-힣]+')
    result = hangul.sub('', text)
    return result
text_cleaning("abc가나다123 라마사아 123")

'가나다 라마사아 '

df['ko_text'] = df['review'].apply(lambda x : text_cleaning(str(x))) # null 값
df['ko_text']

0     예약할 때는 룸을 주기로 하고 홀을 주고 덥고 직원들이 정신이 없어 그 가격에 내가...
1     점심식사 잘했던곳후식커피한잔 하기도 좋고 주차가능합니다 음식 맛있고 직원분 친절하여...
2                                                      
3                                                녹는다 녹아
4                                                      
                            ...                        
75    이렇게 대기가 긴 맛집인줄 모르고 갔다가 엄청 기다림 예써라는 어플로 대기 하던데 ...
76    단짠의 정석 진짜 정석으로 달고 짬 질리는 맛  사장님이랑 와이프로 추정되는 서빙해...
77                                           만족스러움 맛있어용
78    곱창은 없고 대창만 들어있어서 느끼한데 양념은 너무 매워서 위에 탈이나 고생했습니다ㅠㅠ 
79    대창덮밥도 맛있고 곱도리탕도 맛나요 완전 소주각입니다  자리가 쫍아서 테이블마다 가...
Name: ko_text, Length: 80, dtype: object

df['review'].head()

0    예약할 때는 룸을 주기로 하고 홀을 주고, 덥고, 직원들이 정신이 없어 그 가격에 ...
1    점심식사 잘했던곳.후식커피한잔 하기도 좋고 주차가능합니다. 음식 맛있고 직원분 친절...
2                                           新鮮でおいしいです。
3                                               녹는다 녹아
4                                                  NaN
Name: review, dtype: object

df1 = df.loc[df['ko_text'].apply(lambda x : len(x)) > 0]
df1.isnull().value_counts()

score  review  y      ko_text
False  False   False  False      65
dtype: int64

del df['review']
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80 entries, 0 to 79
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   score    80 non-null     int64 
 1   y        80 non-null     int64 
 2   ko_text  80 non-null     object
dtypes: int64(2), object(1)
memory usage: 2.0+ KB

from konlpy.tag import Okt

# 텍스트 데이터 형태소 추출
def get_pos(x) :
    tagger = Okt()
    pos = tagger.pos(x)
    # word : konlpy 모듈 형태소 분석단어
    # tag : 형태소 분석된 품사
    pos = ['{0}/{1}'.format(word, tag) for word, tag in pos]
    return pos

result = get_pos(df['ko_text'].values[0])
print(result)

['예약/Noun', '할/Verb', '때/Noun', '는/Josa', '룸/Noun', '을/Josa', '주기/Noun', '로/Josa', '하고/Verb', '홀/Noun', '을/Josa', '주고/Verb', '덥고/Adjective', '직원/Noun', '들/Suffix', '이/Josa', '정신/Noun', '이/Josa', '없어/Adjective', '그/Noun', '가격/Noun', '에/Josa', '내/Noun', '가/Josa', '직접/Noun', '구워/Verb', '먹고/Verb', '갈비살/Noun', '등심/Noun', '은/Josa', '질/Noun', '기고/Noun', '냉면/Noun', '은/Josa', '맛/Noun', '이/Josa', '없고/Adjective', '장어/Noun', '양념/Noun', '들/Suffix', '도/Josa', '제/Noun', '때/Noun', '안/Noun', '가져다/Verb', '주고/Verb', '회식/Noun', '으로/Josa', '한/Determiner', '시간/Noun', '만에/Josa', '만원/Noun', '을/Josa', '썼는데/Verb', '이런/Adjective', '경험/Noun', '처음/Noun', '입니다/Adjective']

from sklearn.feature_extraction.text import CountVectorizer
                     #글뭉치(corpus) 인덱스로 생성
index_vectorizer = CountVectorizer(tokenizer = lambda x : get_pos(x))
# 
 # 형태소분석하고 단어품사로 분리
x = index_vectorizer.fit_transform(df['ko_text'].tolist())
x.shape

# (80, 779)

for a in x[:10] :
    print(a)
    
(0, 504)	1
  (0, 743)	1
  (0, 224)	2
  (0, 162)	1
  (0, 236)	1
  (0, 538)	3
  (0, 631)	1
  (0, 235)	1
  (0, 721)	1
  (0, 769)	1
  (0, 629)	2
  (0, 189)	1
  (0, 650)	1
  (0, 210)	2
  (0, 546)	3
  (0, 609)	1
  (0, 485)	1
  (0, 97)	1
  (0, 18)	1
  (0, 491)	1
  (0, 141)	1
  (0, 13)	1
  (0, 651)	1
  (0, 87)	1
  (0, 281)	1
  (0, 34)	1
  (0, 222)	1
  (0, 537)	2
  (0, 653)	1
  (0, 107)	1
  (0, 145)	1
  (0, 258)	1
  (0, 481)	1
  (0, 588)	1
  (0, 468)	1
  (0, 192)	1
  (0, 610)	1
  (0, 453)	1
  (0, 29)	1
  (0, 772)	1
  (0, 536)	1
  (0, 738)	1
  (0, 417)	1
  (0, 250)	1
  (0, 251)	1
  (0, 439)	1
  (0, 551)	1
  (0, 61)	1
  (0, 672)	1
  (0, 573)	1
  (0, 650)	1
  (0, 13)	1
  (0, 604)	1
  (0, 585)	1
  (0, 761)	1
  (0, 79)	1
  (0, 776)	1
  (0, 691)	1
  (0, 723)	1
  (0, 618)	1
  (0, 635)	1
  (0, 22)	1
  (0, 540)	1
  (0, 261)	1
  (0, 363)	1
  (0, 689)	1
  (0, 600)	1
  (0, 321)	1
  (0, 648)	1

  (0, 154)	1
  (0, 155)	1


  (0, 162)	1
  (0, 546)	1
  (0, 491)	1
  (0, 192)	1
  (0, 251)	1
  (0, 672)	1
  (0, 529)	1
  (0, 238)	1
  (0, 451)	1
  (0, 454)	2
  (0, 443)	1
  (0, 444)	1
  (0, 506)	1
  (0, 247)	1
  (0, 516)	1
  (0, 397)	1
  (0, 49)	1
  (0, 129)	1
  (0, 641)	1
  (0, 333)	1
  (0, 66)	1
  (0, 511)	1
  (0, 116)	1
  (0, 54)	2
  (0, 318)	1
  (0, 643)	1
  (0, 509)	1
  (0, 460)	1
  (0, 547)	1
  (0, 58)	1
  (0, 409)	1
  (0, 569)	1
  (0, 71)	1
  (0, 446)	1
  (0, 301)	1
  (0, 265)	1
  (0, 649)	1
  (0, 191)	1
  (0, 168)	1
  (0, 510)	1
  (0, 48)	1
  (0, 660)	1
  (0, 389)	1
  (0, 657)	1
  (0, 186)	1
  (0, 132)	1
  (0, 538)	2
  (0, 454)	1
  (0, 66)	1
  (0, 450)	1
  (0, 300)	1
  (0, 246)	1
  (0, 527)	1
  (0, 477)	1
  (0, 237)	1
  (0, 285)	2
  (0, 62)	1
  (0, 88)	1
  (0, 337)	1
  (0, 159)	1
  (0, 314)	1
  (0, 352)	1
  (0, 652)	1
  (0, 373)	1
  (0, 437)	1
  (0, 142)	1
  (0, 663)	1
  (0, 637)	1
  (0, 382)	1
  (0, 504)	1
  (0, 546)	2
  (0, 491)	2
  (0, 536)	1
  (0, 49)	1
  (0, 531)	1
  (0, 178)	1
  (0, 599)	1
  (0, 326)	1
  (0, 628)	1
  (0, 297)	1
  (0, 577)	1
  (0, 68)	1
  (0, 457)	1
  (0, 483)	1
  (0, 746)	1
  (0, 669)	1
  (0, 597)	1
  (0, 690)	1
  (0, 494)	1
  (0, 463)	1
  (0, 632)	1
  (0, 239)	1
  (0, 165)	1
  (0, 695)	1
  (0, 213)	1
  (0, 367)	1
  (0, 296)	1
  (0, 298)	1
  (0, 475)	1
  (0, 727)	1
  (0, 713)	1
  (0, 399)	1
  (0, 702)	1
  (0, 412)	1
  (0, 182)	1
  (0, 567)	1
  (0, 255)	1
  (0, 358)	1
  (0, 346)	1
  (0, 18)	1
  (0, 13)	1
  (0, 573)	1
  (0, 397)	1
  (0, 129)	1
  (0, 182)	1
  (0, 428)	1
  (0, 774)	1
  (0, 542)	1
  (0, 147)	1
  (0, 339)	1

print(str(index_vectorizer.vocabulary_)[:60]+"..")

{'예약/Noun': 504, '할/Verb': 743, '때/Noun': 224, '는/Josa': 162..

# TF-IDF 변환

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_vectorizer =  TfidfTransformer()
x = tfidf_vectorizer.fit_transform(x)
print(x.shape)
print(x[0])

(80, 779)
  (0, 772)	0.13918867813287145
  (0, 769)	0.13918867813287145
  (0, 743)	0.13918867813287145
  (0, 738)	0.12718431152605908
  (0, 721)	0.12718431152605908
  (0, 672)	0.10666271126619092
  (0, 653)	0.12718431152605908
  (0, 651)	0.12718431152605908
  (0, 650)	0.09465834465937853
  (0, 631)	0.13918867813287145
  (0, 629)	0.2783773562657429
  (0, 610)	0.12718431152605908
  (0, 609)	0.13918867813287145
  (0, 588)	0.13918867813287145
  (0, 573)	0.11206059819730396
  (0, 551)	0.11866707787300333
  (0, 546)	0.22748699966260583
  (0, 538)	0.31998813379857277
  (0, 537)	0.17228222201264556
  (0, 536)	0.09814547761313519
  (0, 504)	0.12718431152605908
  (0, 491)	0.07253600802895468
  (0, 485)	0.12718431152605908
  (0, 481)	0.11866707787300333
  (0, 468)	0.11866707787300333
  (0, 453)	0.11206059819730396
  (0, 439)	0.13918867813287145
  (0, 417)	0.11866707787300333
  (0, 281)	0.11206059819730396
  (0, 258)	0.07762387735326703
  (0, 251)	0.11866707787300333
  (0, 250)	0.13918867813287145
  (0, 236)	0.13918867813287145
  (0, 235)	0.10209886289470016
  (0, 224)	0.19629095522627038
  (0, 222)	0.12718431152605908
  (0, 210)	0.21332542253238185
  (0, 192)	0.07101739767756768
  (0, 189)	0.13918867813287145
  (0, 162)	0.08377133371130897
  (0, 145)	0.13918867813287145
  (0, 141)	0.12718431152605908
  (0, 107)	0.12718431152605908
  (0, 97)	0.11206059819730396
  (0, 87)	0.12718431152605908
  (0, 61)	0.13918867813287145
  (0, 34)	0.13918867813287145
  (0, 29)	0.13918867813287145
  (0, 18)	0.11866707787300333
  (0, 13)	0.08377133371130897

# 긍부정 리뷰분류
# 데이터셋 분리
from sklearn.model_selection import train_test_split
y = df['y']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)

x_train.shape
# (56, 779)

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state = 0)
lr.fit(x_train, y_train)
y_pred = lr.predict(x_test)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print("accuracy :%.2f" %accuracy_score(y_test, y_pred)) # (TP+TN) / TP+TN+FP+FN
print("precision_score :%.2f" %precision_score(y_test, y_pred))
print("recall_score :%.2f" %recall_score(y_test, y_pred))
print("f1_score :%.2f" %f1_score(y_test, y_pred))

accuracy :0.58
precision_score :0.57
recall_score :1.00
f1_score :0.72

from sklearn.metrics import confusion_matrix
confmat = confusion_matrix(y_test, y_pred)
print(confmat)

[[ 1 10]
 [ 0 13]]

roc
roc
y : tpr true positive rate 진짜 양성비율
x : Fpr false positive rate 가짜 양성비율 1-특이율
auc (Area under the curve) : 곡선 아래 면적

# roc
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
# y_pred 예측값
# y_pred_probability 예측값의 확률값
y_pred_probability = lr.predict_proba(x_test)[:,1]
false_positive_rate, true_positive_rate, thresholds = \
            roc_curve(y_test, y_pred_probability)
roc_auc = roc_auc_score(y_test, y_pred_probability)
print('AUC : %.3f' % roc_auc)

plt.rcParams['figure.figsize'] = [5, 4]
plt.plot(false_positive_rate, true_positive_rate, \
         label='ROC  Curve(area = %0.3f)' % roc_auc, 
         color = 'red', linewidth=4.0)
plt.plot([0,1], [0,1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve of Logistic regression')
plt.legend(loc='lower right')

저작자표시 비영리

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

36. 강남역 고기집 후기분석 \|\| 감성분석 (0)	2021.11.25
35. 강남역 고기집 감성분석 \|\| 감성분석, TF-IDF (0)	2021.11.25
33. white wine \|\| GBM (0)	2021.11.24
32. titanic \|\| GBM (0)	2021.11.24
31. titanic \|\| logistic (0)	2021.11.24

33. white wine || GBM

2021. 11. 24. 17:09

728x90

# white wine 분석
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"
savefile = "winequality-white.csv"
from urllib.request import urlretrieve
urlretrieve(url, savefile)

# ('winequality-white.csv', <http.client.HTTPMessage at 0x214ffabed90>)

winequality-white.csv

0.25MB

df = pd.read_csv('winequality-white.csv', sep=';', encoding = 'utf-8')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         4898 non-null   float64
 1   volatile acidity      4898 non-null   float64
 2   citric acid           4898 non-null   float64
 3   residual sugar        4898 non-null   float64
 4   chlorides             4898 non-null   float64
 5   free sulfur dioxide   4898 non-null   float64
 6   total sulfur dioxide  4898 non-null   float64
 7   density               4898 non-null   float64
 8   pH                    4898 non-null   float64
 9   sulphates             4898 non-null   float64
 10  alcohol               4898 non-null   float64
 11  quality               4898 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 459.3 KB

df.describe()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
count	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000
mean	6.854788	0.278241	0.334192	6.391415	0.045772	35.308085	138.360657	0.994027	3.188267	0.489847	10.514267	5.877909
std	0.843868	0.100795	0.121020	5.072058	0.021848	17.007137	42.498065	0.002991	0.151001	0.114126	1.230621	0.885639
min	3.800000	0.080000	0.000000	0.600000	0.009000	2.000000	9.000000	0.987110	2.720000	0.220000	8.000000	3.000000
25%	6.300000	0.210000	0.270000	1.700000	0.036000	23.000000	108.000000	0.991723	3.090000	0.410000	9.500000	5.000000
50%	6.800000	0.260000	0.320000	5.200000	0.043000	34.000000	134.000000	0.993740	3.180000	0.470000	10.400000	6.000000
75%	7.300000	0.320000	0.390000	9.900000	0.050000	46.000000	167.000000	0.996100	3.280000	0.550000	11.400000	6.000000
max	14.200000	1.100000	1.660000	65.800000	0.346000	289.000000	440.000000	1.038980	3.820000	1.080000	14.200000	9.000000

sns.countplot(df['quality'])

plt.hist(df['quality'])

(array([  20.,  163.,    0., 1457.,    0., 2198.,  880.,    0.,  175.,
           5.]),
 array([3. , 3.6, 4.2, 4.8, 5.4, 6. , 6.6, 7.2, 7.8, 8.4, 9. ]),
 <BarContainer object of 10 artists>)

# df['quality'].value_counts() # 큰순서대로
df.groupby('quality')['quality'].count() # 그대로

quality
3      20
4     163
5    1457
6    2198
7     880
8     175
9       5
Name: quality, dtype: int64

plt.plot(df.groupby('quality')['quality'].count())

# gradientBoostingClassifier
x = df.drop('quality', axis = 1)
y = df['quality']

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=10)

from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
y_pred[:10]

# array([6, 5, 4, 5, 6, 6, 6, 6, 5, 6], dtype=int64)

# 평가
from sklearn.metrics import confusion_matrix
confmat = confusion_matrix(y_true = y_test, y_pred = y_pred)
print(confmat)

[[  0   1   0   0   1   0   0]
 [  0   2  18   9   0   0   0]
 [  0   6 159 109   5   1   0]
 [  1   6  73 337  33   0   0]
 [  0   0   6 104  68   2   0]
 [  0   0   0  19   8  10   1]
 [  0   0   0   1   0   0   0]]

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('정확도(accuracy) : %.2f'% accuracy_score(y_test, y_pred))
# print('정밀도(precision) : %.3f'% precision_score(y_test, y_pred))
# print('재현율(recall) : %.3f'% recall_score(y_test, y_pred))
# print('F1-score : %.3f'% f1_score(y_test, y_pred))
# # f= 2*(정밀도*재현율)/(정밀도+재현율)

정확도(accuracy) : 0.59

# y  = 3 ~9
# 3개 등급 으로
df.groupby('quality')['quality'].count()

quality
3      20
4     163
5    1457
6    2198
7     880
8     175
9       5
Name: quality, dtype: int64

y = df['quality']
newlist = []
for v in list(y) :
    if v <= 4 :
        newlist += [0]
    elif v <= 7 :
        newlist += [1]
    else :
        newlist += [2]
y = newlist
y[:10]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=10)
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
y_pred[:10]

array([1, 1, 0, 1, 1, 1, 1, 1, 1, 1])

from sklearn.metrics import confusion_matrix
confmat = confusion_matrix(y_true = y_test, y_pred = y_pred)
print(confmat)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('정확도(accuracy) : %.2f'% accuracy_score(y_test, y_pred))

[[  3  28   0]
 [  8 899   3]
 [  0  30   9]]
정확도(accuracy) : 0.93

저작자표시 비영리

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

35. 강남역 고기집 감성분석 \|\| 감성분석, TF-IDF (0)	2021.11.25
34. 강남역 고기집 후기분석 \|\| 맵크로울링 (0)	2021.11.25
32. titanic \|\| GBM (0)	2021.11.24
31. titanic \|\| logistic (0)	2021.11.24
30. 보스턴 주택가격정보 \|\| 선형회귀 (0)	2021.11.24

32. titanic || GBM

2021. 11. 24. 17:05

728x90

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df_train = pd.read_csv('titanic_train.csv')
df_test = pd.read_csv('titanic_test.csv')
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 916 entries, 0 to 915
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     916 non-null    int64  
 1   survived   916 non-null    int64  
 2   name       916 non-null    object 
 3   sex        916 non-null    object 
 4   age        741 non-null    float64
 5   sibsp      916 non-null    int64  
 6   parch      916 non-null    int64  
 7   ticket     916 non-null    object 
 8   fare       916 non-null    float64
 9   cabin      214 non-null    object 
 10  embarked   914 non-null    object 
 11  body       85 non-null     float64
 12  home.dest  527 non-null    object 
dtypes: float64(3), int64(4), object(6)
memory usage: 93.2+ KB

df_train = df_train.drop(['ticket','body','home.dest'], axis=1)
df_test = df_test.drop(['ticket','body','home.dest'], axis=1)

df_train['age'] = df_train['age'].fillna(age_mean)
df_test['age'] = df_test['age'].fillna(age_mean)

em_mode = df_train['embarked'].value_counts().index[0]
df_train['embarked'] = df_train['embarked'].fillna(em_mode)
df_test['embarked'] = df_test['embarked'].fillna(em_mode)

whole_df = df_train.append(df_test)
train_idx_num = len(df_train)
whole_df['cabin'].value_counts()

C23 C25 C27        6
G6                 5
B57 B59 B63 B66    5
D                  4
F2                 4
                  ..
A20                1
C128               1
D6                 1
C49                1
A10                1
Name: cabin, Length: 186, dtype: int64

whole_df['cabin'].isnull().value_counts()

True     1014
False     295
Name: cabin, dtype: int64

whole_df['cabin'] = whole_df['cabin'].fillna('X')
whole_df['cabin'].value_counts()

X                  1014
C23 C25 C27           6
G6                    5
B57 B59 B63 B66       5
F2                    4
                   ... 
A9                    1
E52                   1
C95                   1
C99                   1
A10                   1
Name: cabin, Length: 187, dtype: int64

whole_df['cabin'].unique()

array(['X', 'E36', 'C68', 'E24', 'C22 C26', 'D38', 'B50', 'A24', 'C111',
       'F', 'C6', 'C87', 'E8', 'B45', 'C93', 'D28', 'D36', 'C125', 'B35',
       'T', 'B73', 'B57 B59 B63 B66', 'A26', 'A18', 'B96 B98', 'G6',
       'C78', 'C101', 'D9', 'D33', 'C128', 'E50', 'B26', 'B69', 'E121',
       'C123', 'B94', 'A34', 'D', 'C39', 'D43', 'E31', 'B5', 'D17', 'F33',
       'E44', 'D7', 'A21', 'D34', 'A29', 'D35', 'A11', 'B51 B53 B55',
       'D46', 'E60', 'C30', 'D26', 'E68', 'A9', 'B71', 'D37', 'F2',
       'C55 C57', 'C89', 'C124', 'C23 C25 C27', 'C126', 'E49', 'F E46',
       'E46', 'D19', 'B58 B60', 'C82', 'B52 B54 B56', 'C92', 'E45',
       'F G73', 'C65', 'E25', 'B3', 'D40', 'C91', 'B102', 'B61', 'F G63',
       'A20', 'B36', 'C7', 'B77', 'D20', 'C148', 'C105', 'E38', 'B86',
       'C132', 'C86', 'A14', 'C54', 'A5', 'B49', 'B28', 'B24', 'C2', 'F4',
       'A6', 'C83', 'B42', 'A36', 'C52', 'D56', 'C116', 'B19', 'E77',
       'F E57', 'E101', 'B18', 'C95', 'D15', 'E33', 'B30', 'D21', 'E10',
       'C130', 'D6', 'C51', 'D30', 'E67', 'C110', 'C103', 'C90', 'C118',
       'C97', 'D47', 'E34', 'B4', 'D50', 'C62 C64', 'E17', 'B41', 'C49',
       'C85', 'B20', 'C28', 'E63', 'C99', 'D49', 'A10', 'A16', 'B37',
       'C80', 'B78', 'E12', 'C104', 'A31', 'D11', 'D48', 'D10 D12', 'B38',
       'D45', 'C50', 'C31', 'B82 B84', 'A32', 'C53', 'B10', 'C70', 'A23',
       'C106', 'C46', 'E58', 'B11', 'F E69', 'B80', 'E39 E41', 'D22',
       'E40', 'A19', 'C32', 'B79', 'C45', 'B22', 'B39', 'C47', 'B101',
       'A7', 'E52', 'F38'], dtype=object)

# whole_df['cabin'] = whole_df['cabin'].values
whole_df['cabin'] = [ ca[0] for ca in  whole_df['cabin'].values ]
# whole_df['cabin'] = whole_df['cabin'].apply(lambda x : x[0])

whole_df['cabin'].value_counts()
X    1014
C      94
B      65
D      46
E      41
A      22
F      21
G       5
T       1
Name: cabin, dtype: int64

whole_df['cabin'] = whole_df['cabin'].replace('G', 'X')
whole_df['cabin'] = whole_df['cabin'].replace('T', 'X')

whole_df['cabin'].value_counts()

X    1020
C      94
B      65
D      46
E      41
A      22
F      21
Name: cabin, dtype: int64

sns.countplot(x='cabin', hue='survived', data = whole_df)

whole_df['name']

0                 Mellinger, Miss. Madeleine Violet
1                                 Wells, Miss. Joan
2                    Duran y More, Miss. Florentina
3                                Scanlan, Mr. James
4                      Bradley, Miss. Bridget Delia
                           ...                     
388               Karlsson, Mr. Julius Konrad Eugen
389    Ware, Mrs. John James (Florence Louise Long)
390                            O'Keefe, Mr. Patrick
391                                Tobin, Mr. Roger
392                            Daniels, Miss. Sarah
Name: name, Length: 1309, dtype: object

# import re
# re.compile(',')
n_grade = whole_df['name'].apply(lambda  x : x.split(", ")[1].split(".")[0])
n_grade = n_grade.unique().tolist()
n_grade
# nana = [ na[na.find(',')+2 : na.find('.')] for na in whole_df['name'].values]
# nana

['Miss',
 'Mr',
 'Master',
 'Mrs',
 'Dr',
 'Mlle',
 'Col',
 'Rev',
 'Ms',
 'Mme',
 'Sir',
 'the Countess',
 'Dona',
 'Jonkheer',
 'Lady',
 'Major',
 'Don',
 'Capt']

# 호칭에 따른 사회적 지위 정의
grade_dict = {
    'A' : ['Rev', 'Col', 'Major', 'Dr', 'Capt', 'Sir'], # 명예직
'B' : ['Ms', 'Mme','Mrs','Dona'], # 여성
'C' : ['Jonkheer','the Countess'], # 귀족
'D' : ['Mr','Don'], # 남성
'E' : ['Master'], # 젊은 남성
'F' : ['Miss','Mlle','Lady'] # 젊은 여성
}

print(grade_dict.values())
print(grade_dict['A'])

dict_values([['Rev', 'Col', 'Major', 'Dr', 'Capt', 'Sir'], ['Ms', 'Mme', 'Mrs', 'Dona'], ['Jonkheer', 'the Countess'], ['Mr', 'Don'], ['Master'], ['Miss', 'Mlle', 'Lady']])
['Rev', 'Col', 'Major', 'Dr', 'Capt', 'Sir']

def give_grade(x) : # name 지위를 쭉 나열한 컬럼을 넣었을때  
    g = x.split(", ")[1].split(".")[0]
    for k, v in grade_dict.items() :
        for title in v :
            if g == title :
                return k
    return 'G'
whole_df['name'] = whole_df['name'].apply(lambda x : give_grade(x))

whole_df['name'].value_counts()

D    758
F    263
B    201
E     61
A     24
C      2
Name: name, dtype: int64

sns.countplot(x=whole_df['name'], hue=whole_df['survived'])

# 인코딩
whole_df_encoded = pd.get_dummies(whole_df)
whole_df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 392
Data columns (total 24 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   pclass      1309 non-null   int64  
 1   survived    1309 non-null   int64  
 2   age         1309 non-null   float64
 3   sibsp       1309 non-null   int64  
 4   parch       1309 non-null   int64  
 5   fare        1309 non-null   float64
 6   name_A      1309 non-null   uint8  
 7   name_B      1309 non-null   uint8  
 8   name_C      1309 non-null   uint8  
 9   name_D      1309 non-null   uint8  
 10  name_E      1309 non-null   uint8  
 11  name_F      1309 non-null   uint8  
 12  sex_female  1309 non-null   uint8  
 13  sex_male    1309 non-null   uint8  
 14  cabin_A     1309 non-null   uint8  
 15  cabin_B     1309 non-null   uint8  
 16  cabin_C     1309 non-null   uint8  
 17  cabin_D     1309 non-null   uint8  
 18  cabin_E     1309 non-null   uint8  
 19  cabin_F     1309 non-null   uint8  
 20  cabin_X     1309 non-null   uint8  
 21  embarked_C  1309 non-null   uint8  
 22  embarked_Q  1309 non-null   uint8  
 23  embarked_S  1309 non-null   uint8  
dtypes: float64(2), int64(4), uint8(18)
memory usage: 134.6 KB

# 학습데이터의 독립변수 x_train
x_train = whole_df_encoded[:train_num+1]
x_train = x_train.loc[:,x_train.columns != 'survived'].values
y_train = whole_df_encoded[:train_num+1]['survived']

x_test = whole_df_encoded[train_num+1:]
x_test = x_test.loc[:,x_test.columns != 'survived'].values
y_test = whole_df_encoded[train_num+1:]['survived']

x_train.shape

# (917, 23)

y_train = df_train['survived'].values
x_train = df_train.loc[:,df_train.columns != 'survived'].values
y_test = df_test['survived'].values
x_test = df_train.loc[:,df_train.columns != 'survived'].values

ttt = df_train.copy()
ttt['name'] = ttt['name'].apply(lambda x : x.split(', ')[1].split('.')[0])
for i in ttt['name'] :
    if i in grade_dict['A'] :
        ttt.name.replace(i, 'A', inplace = True)
        print()
    elif i in grade_dict['B'] :
        ttt.name.replace(i, 'B', inplace = True)
    elif i in grade_dict['C'] :
        ttt.name.replace(i, 'C', inplace = True)
    elif i in grade_dict['D'] :
        ttt.name.replace(i, 'D', inplace = True)
    elif i in grade_dict['E'] :
        ttt.name.replace(i, 'E', inplace = True)
    elif i in grade_dict['F'] :
        ttt.name.replace(i, 'F', inplace = True)
    else :
        ttt.name.replace(i, 'G', inplace = True)

y_test

1      1
2      0
3      0
4      0
5      1
      ..
388    0
389    1
390    1
391    0
392    1
Name: survived, Length: 392, dtype: int64

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state=0)
lr.fit(x_train, y_train)
y_pred = lr.predict(x_test)

# 평가
from sklearn.metrics import confusion_matrix
confmat = confusion_matrix(y_true = y_test, y_pred = y_pred)
print(confmat)

[[208  37]
 [ 42 105]]

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('정확도(accuracy) : %.2f'% accuracy_score(y_test, y_pred))
print('정밀도(precision) : %.3f'% precision_score(y_test, y_pred))
print('재현율(recall) : %.3f'% recall_score(y_test, y_pred))
print('F1-score : %.3f'% f1_score(y_test, y_pred))
# f= 2*(정밀도*재현율)/(정밀도+재현율)

정확도(accuracy) : 0.80
정밀도(precision) : 0.739
재현율(recall) : 0.714
F1-score : 0.727

저작자표시 비영리

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

34. 강남역 고기집 후기분석 \|\| 맵크로울링 (0)	2021.11.25
33. white wine \|\| GBM (0)	2021.11.24
31. titanic \|\| logistic (0)	2021.11.24
30. 보스턴 주택가격정보 \|\| 선형회귀 (0)	2021.11.24
29. 비트코인 시계열 분석 \|\| prophet (0)	2021.11.24

31. titanic || logistic

2021. 11. 24. 16:57

728x90

# 분류
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df_train = pd.read_csv('titanic_train.csv')
df_test = pd.read_csv('titanic_test.csv')
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 916 entries, 0 to 915
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     916 non-null    int64  
 1   survived   916 non-null    int64  
 2   name       916 non-null    object 
 3   sex        916 non-null    object 
 4   age        741 non-null    float64
 5   sibsp      916 non-null    int64  
 6   parch      916 non-null    int64  
 7   ticket     916 non-null    object 
 8   fare       916 non-null    float64
 9   cabin      214 non-null    object 
 10  embarked   914 non-null    object 
 11  body       85 non-null     float64
 12  home.dest  527 non-null    object 
dtypes: float64(3), int64(4), object(6)
memory usage: 93.2+ KB

df_train['survived'].value_counts()

0    563
1    353
Name: survived, dtype: int64

df_train['survived'].value_counts().plot.bar()

df_train[['pclass','survived']].value_counts().sort_index().plot.bar()

ax = sns.countplot(x='pclass', hue = 'survived', data = df_train)

df_train[['sex','survived']].value_counts().sort_index().plot.bar()

ax = sns.countplot(x='sex', hue = 'survived', data = df_train)

# 분류
# 결측치 처리
- 삭제 : 처리가 쉽지만, 중요정보 삭제
# 변경
- 평균값 또는 중앙값, 최빈값으로 처리

df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 916 entries, 0 to 915
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     916 non-null    int64  
 1   survived   916 non-null    int64  
 2   name       916 non-null    object 
 3   sex        916 non-null    object 
 4   age        741 non-null    float64
 5   sibsp      916 non-null    int64  
 6   parch      916 non-null    int64  
 7   ticket     916 non-null    object 
 8   fare       916 non-null    float64
 9   cabin      214 non-null    object 
 10  embarked   914 non-null    object 
 11  body       85 non-null     float64
 12  home.dest  527 non-null    object 
dtypes: float64(3), int64(4), object(6)
memory usage: 93.2+ KB

age_mean = df_train['age'].mean()
age_mean

30.23144399460189

df_train['age'] = df_train['age'].fillna(age_mean)
df_test['age'] = df_test['age'].fillna(age_mean)
# age_mean = df_train['age'].mean(skipna = False)
df_train['age']

0      13.000000
1       4.000000
2      30.000000
3      30.231444
4      22.000000
         ...    
911     0.170000
912    30.231444
913    30.231444
914    20.000000
915    32.000000
Name: age, Length: 916, dtype: float64

df_train['embarked'].isnull().value_counts()

False    914
True       2
Name: embarked, dtype: int64

replace_embarked = df_train['embarked'].value_counts().index[0]

df_train['embarked'] = df_train['embarked'].fillna(replace_embarked)
df_test['embarked'] = df_test['embarked'].fillna(replace_embarked)

df_train['embarked']

0      S
1      S
2      C
3      Q
4      Q
      ..
911    S
912    S
913    Q
914    S
915    Q
Name: embarked, Length: 916, dtype: object

df_train = df_train.drop(['name','ticket','body','cabin','home.dest'], axis=1)
df_test = df_test.drop(['name','ticket','body','cabin','home.dest'], axis=1)

df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 393 entries, 0 to 392
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pclass    393 non-null    int64  
 1   survived  393 non-null    int64  
 2   sex       393 non-null    object 
 3   age       393 non-null    float64
 4   sibsp     393 non-null    int64  
 5   parch     393 non-null    int64  
 6   fare      393 non-null    float64
 7   embarked  393 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 24.7+ KB

whole_df = df_train.append(df_test)
whole_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 392
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pclass    1309 non-null   int64  
 1   survived  1309 non-null   int64  
 2   sex       1309 non-null   object 
 3   age       1309 non-null   float64
 4   sibsp     1309 non-null   int64  
 5   parch     1309 non-null   int64  
 6   fare      1309 non-null   float64
 7   embarked  1309 non-null   object 
dtypes: float64(2), int64(4), object(2)
memory usage: 92.0+ KB

train_num = len(df_train)

whole_df_encoded = pd.get_dummies(whole_df)
whole_df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 392
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   pclass      1309 non-null   int64  
 1   survived    1309 non-null   int64  
 2   age         1309 non-null   float64
 3   sibsp       1309 non-null   int64  
 4   parch       1309 non-null   int64  
 5   fare        1309 non-null   float64
 6   sex_female  1309 non-null   uint8  
 7   sex_male    1309 non-null   uint8  
 8   embarked_C  1309 non-null   uint8  
 9   embarked_Q  1309 non-null   uint8  
 10  embarked_S  1309 non-null   uint8  
dtypes: float64(2), int64(4), uint8(5)
memory usage: 78.0 KB

df_train = whole_df_encoded[:train_num]
df_test = whole_df_encoded[train_num:]
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 916 entries, 0 to 915
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   pclass      916 non-null    int64  
 1   survived    916 non-null    int64  
 2   age         916 non-null    float64
 3   sibsp       916 non-null    int64  
 4   parch       916 non-null    int64  
 5   fare        916 non-null    float64
 6   sex_female  916 non-null    uint8  
 7   sex_male    916 non-null    uint8  
 8   embarked_C  916 non-null    uint8  
 9   embarked_Q  916 non-null    uint8  
 10  embarked_S  916 non-null    uint8  
dtypes: float64(2), int64(4), uint8(5)
memory usage: 54.6 KB

df_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 393 entries, 0 to 392
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   pclass      393 non-null    int64  
 1   survived    393 non-null    int64  
 2   age         393 non-null    float64
 3   sibsp       393 non-null    int64  
 4   parch       393 non-null    int64  
 5   fare        393 non-null    float64
 6   sex_female  393 non-null    uint8  
 7   sex_male    393 non-null    uint8  
 8   embarked_C  393 non-null    uint8  
 9   embarked_Q  393 non-null    uint8  
 10  embarked_S  393 non-null    uint8  
dtypes: float64(2), int64(4), uint8(5)
memory usage: 23.4 KB

y_train = df_train['survived'].values
x_train = df_train.loc[:,df_train.columns != 'survived'].values
y_test = df_test['survived'].values
x_test = df_train.loc[:,df_train.columns != 'survived'].values

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state=0)
lr.fit(x_train, y_train)

y_pred = lr.predict(x_test)

# 평가
from sklearn.metrics import confusion_matrix
confmat = confusion_matrix(y_true = y_test, y_pred = y_pred)
print(confmat)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('정확도(accuracy) : %.2f'% accuracy_score(y_test, y_pred))
print('정밀도(precision) : %.3f'% precision_score(y_test, y_pred))
print('재현율(recall) : %.3f'% recall_score(y_test, y_pred))
print('F1-score : %.3f'% f1_score(y_test, y_pred))
# f= 2*(정밀도*재현율)/(정밀도+재현율)

저작자표시 비영리

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

33. white wine \|\| GBM (0)	2021.11.24
32. titanic \|\| GBM (0)	2021.11.24
30. 보스턴 주택가격정보 \|\| 선형회귀 (0)	2021.11.24
29. 비트코인 시계열 분석 \|\| prophet (0)	2021.11.24
28. 비트코인 가격 시계열 분석 \|\| Arima, fbProphet (0)	2021.11.24

30. 보스턴 주택가격정보 || 선형회귀

2021. 11. 24. 15:37

728x90

# 보스턴 주택가격정보
# http://lib.stat.cmu.edu/datasets/boston_correted.txt

BostonHousing2.csv

0.05MB

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns housing = pd.read_csv('BostonHousing2.csv') housing.head() TOWN LON LAT CMEDV CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT 0 Nahant -70.955 42.2550 24.0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 1 Swampscott -70.950 42.2875 21.6 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 2 Swampscott -70.936 42.2830 34.7 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 3 Marblehead -70.928 42.2930 33.4 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 4 Marblehead -70.922 42.2980 36.2 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33

housing = housing.rename(columns = {'CMEDV':'y'}) housing.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 TOWN 506 non-null object 1 LON 506 non-null float64 2 LAT 506 non-null float64 3 y 506 non-null float64 4 CRIM 506 non-null float64 5 ZN 506 non-null float64 6 INDUS 506 non-null float64 7 CHAS 506 non-null int64 8 NOX 506 non-null float64 9 RM 506 non-null float64 10 AGE 506 non-null float64 11 DIS 506 non-null float64 12 RAD 506 non-null int64 13 TAX 506 non-null int64 14 PTRATIO 506 non-null float64 15 B 506 non-null float64 16 LSTAT 506 non-null float64 dtypes: float64(13), int64(3), object(1) memory usage: 67.3+ KB

import seaborn as sns cols = ['y','RM','LSTAT','NOX'] sns.pairplot( housing[cols]) plt.show() # y, RM 양의 상관관계, LSTAT, NOX 음의 상관관계

# 독립속성, 종속속성 선택 y = housing['y']

from sklearn import linear_model from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from math import sqrt #1 독립속성 선택 x = housing[['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE', 'DIS','RAD','TAX','PTRATIO','B','LSTAT']] # 2 종속속성 선택 y = housing['y'] x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=33)

lr = linear_model.LinearRegression() model = lr.fit(x_train, y_train) print(model.score(x_train, y_train)) print(model.score(x_test, y_test)) 0.7490284664199387 0.7009342135321538

print(lr.coef_) [-1.11193551e-01 5.09415195e-02 3.25436161e-02 3.02115825e+00 -1.54108556e+01 4.04590890e+00 -1.97595267e-03 -1.56114408e+00 3.27038718e-01 -1.38825230e-02 -8.22151628e-01 8.74659468e-03 -5.85060261e-01]

y_pred = lr.predict(x_train) rmse = sqrt(mean_squared_error(y_train, y_pred)) print(rmse) 4.672162734008588

y_pred = lr.predict(x_test) rmse = sqrt(mean_squared_error(y_test, y_pred)) print(rmse) 4.614951784913319

y[:10] 0 24.0 1 21.6 2 34.7 3 33.4 4 36.2 5 28.7 6 22.9 7 22.1 8 16.5 9 18.9 Name: y, dtype: float64

저작자표시 비영리

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

32. titanic \|\| GBM (0)	2021.11.24
31. titanic \|\| logistic (0)	2021.11.24
29. 비트코인 시계열 분석 \|\| prophet (0)	2021.11.24
28. 비트코인 가격 시계열 분석 \|\| Arima, fbProphet (0)	2021.11.24
27. 프로야구 연봉 예측 분석 \|\| OLS, Heatmap (0)	2021.11.24

29. 비트코인 시계열 분석 || prophet

2021. 11. 24. 15:30

728x90

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from fbprophet import Prophet
file_path = 'market-price.csv'
bitcoin_df = pd.read_csv(file_path, names = ['ds', 'y'], header=0)
# 상한가 설정
bitcoin_df['cap'] = 20000
# 하한가 설정 
# bitcoin_df['floor'] = 2000
# growth = logistic  상한설정시 추가, 비선형방식으로 분석
prophet = Prophet(seasonality_mode = 'multiplicative',
                  growth = 'logistic', # 상하한가 설정할때 , 비선형방식
                 yearly_seasonality = True, # 연별
                 weekly_seasonality = True, # 주별
                 daily_seasonality = True, # 일별
                 changepoint_prior_scale = 0.5) # 과적합 방지 0.5만큼 만 분석
prophet.fit(bitcoin_df) # 학습하기

bitcoin_df.head()


ds	y	cap
0	2017-08-27 00:00:00	4354.308333	20000
1	2017-08-28 00:00:00	4391.673517	20000
2	2017-08-29 00:00:00	4607.985450	20000
3	2017-08-30 00:00:00	4594.987850	20000
4	2017-08-31 00:00:00	4748.255000	20000

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from fbprophet import Prophet
file_path = 'market-price.csv'
bitcoin_df = pd.read_csv(file_path, names = ['ds', 'y'], header=0)
# 상한가 설정
bitcoin_df['cap'] = 20000
# 하한가 설정 
# bitcoin_df['floor'] = 2000
# growth = logistic  상한설정시 추가, 비선형방식으로 분석
prophet = Prophet(seasonality_mode = 'multiplicative',
                  growth = 'logistic', # 상하한가 설정할때 , 비선형방식
                 yearly_seasonality = True, # 연별
                 weekly_seasonality = True, # 주별
                 daily_seasonality = True, # 일별
                 changepoint_prior_scale = 0.5) # 과적합 방지 0.5만큼 만 분석
(
    growth='linear',
    changepoints=None,
    n_changepoints=25,
    changepoint_range=0.8,
    yearly_seasonality='auto',
    weekly_seasonality='auto',
    daily_seasonality='auto',
    holidays=None,
    seasonality_mode='additive',
    seasonality_prior_scale=10.0,
    holidays_prior_scale=10.0,
    changepoint_prior_scale=0.05,
    mcmc_samples=0,
    interval_width=0.8,
    uncertainty_samples=1000,
    stan_backend=None,
)

# 5일 앞을 예측하기
future_data = prophet.make_future_dataframe(periods=5, freq='d')
# 상한가 설정
future_data['cap'] = 20000
# future_data['floor'] = 2000
# 예측
forecast_data = prophet.predict(future_data)
forecast_data

	ds	trend	cap	yhat_lower	yhat_upper	trend_lower	trend_upper	daily	daily_lower	daily_upper	...	weekly	weekly_lower	weekly_upper	yearly	yearly_lower	yearly_upper	additive_terms	additive_terms_lower	additive_terms_upper	yhat
0	2017-08-27	5621.085431	20000	4008.821488	5757.019304	5621.085431	5621.085431	0.311474	0.311474	0.311474	...	0.002289	0.002289	0.002289	-0.440095	-0.440095	-0.440095	0.0	0.0	0.0	4910.962354
1	2017-08-28	5626.023045	20000	3955.585468	5723.896533	5626.023045	5626.023045	0.311474	0.311474	0.311474	...	-0.000562	-0.000562	-0.000562	-0.449330	-0.449330	-0.449330	0.0	0.0	0.0	4847.280361
2	2017-08-29	5630.963297	20000	3911.998721	5680.326522	5630.963297	5630.963297	0.311474	0.311474	0.311474	...	-0.000493	-0.000493	-0.000493	-0.459435	-0.459435	-0.459435	0.0	0.0	0.0	4795.023355
3	2017-08-30	5635.906186	20000	3875.520852	5590.533721	5635.906186	5635.906186	0.311474	0.311474	0.311474	...	-0.007356	-0.007356	-0.007356	-0.470318	-0.470318	-0.470318	0.0	0.0	0.0	4699.215406
4	2017-08-31	5640.851711	20000	3810.130956	5423.302109	5640.851711	5640.851711	0.311474	0.311474	0.311474	...	-0.005606	-0.005606	-0.005606	-0.481860	-0.481860	-0.481860	0.0	0.0	0.0	4648.105600
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
365	2018-08-27	7108.210766	20000	5313.384266	7048.183238	7108.210766	7108.210766	0.311474	0.311474	0.311474	...	-0.000562	-0.000562	-0.000562	-0.437930	-0.437930	-0.437930	0.0	0.0	0.0	6205.334670
366	2018-08-28	7111.966217	20000	5257.614559	7014.891617	7111.966217	7111.966217	0.311474	0.311474	0.311474	...	-0.000493	-0.000493	-0.000493	-0.446936	-0.446936	-0.446936	0.0	0.0	0.0	6145.052289
367	2018-08-29	7115.722558	20000	5183.980838	6862.843499	7115.722558	7115.722558	0.311474	0.311474	0.311474	...	-0.007356	-0.007356	-0.007356	-0.456831	-0.456831	-0.456831	0.0	0.0	0.0	6029.053882
368	2018-08-30	7119.479786	20000	5142.817816	6755.514784	7119.464525	7119.496318	0.311474	0.311474	0.311474	...	-0.005606	-0.005606	-0.005606	-0.467530	-0.467530	-0.467530	0.0	0.0	0.0	5968.527240
369	2018-08-31	7123.237902	20000	5101.560913	6845.461477	7123.183774	7123.294421	0.311474	0.311474	0.311474	...	0.000310	0.000310	0.000310	-0.478920	-0.478920	-0.478920	0.0	0.0	0.0	5932.681373
370 rows × 23 columns

# 그래프로 작성
fig = prophet.plot(forecast_data)

# 실제 데이터와 비교하기
# 예측데이터
pred_y = forecast_data.yhat.values[-5:]
pred_y 

# array([6205.33466998, 6145.05228906, 6029.05388165, 5968.52723998,
#        5932.68137291])

# 실제데이터
test_file_path = 'market-price-test.csv'
bitcoin_test_df = pd.read_csv(test_file_path, names = ['ds', 'y'], header=0)
test_y = bitcoin_test_df.y.values

# 예측최소데이터
pred_y_lower = forecast_data.yhat_lower.values[-5:]

# 예측최대데이터
pred_y_upper = forecast_data.yhat_upper.values[-5:]

plt.plot(pred_y, color = 'gold') # 모델 예측한 가격그래프
plt.plot(pred_y_lower, color = 'red') # 모델이 예상한 최소 가격 그래프
plt.plot(pred_y_upper, color = 'blue') # 모델이 예상한 최대 가격 그래프
plt.plot(test_y, color = 'green') # 실제 가격 그래프

# 이상치 제거
bitcoin_df = pd.read_csv(file_path, names = ['ds', 'y'], header=0)
bitcoin_df.loc[bitcoin_df['y'] > 18000, 'y'] = None
bitcoin_df.info()
# 3건 제거

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ds      365 non-null    object 
 1   y       362 non-null    float64
dtypes: float64(1), object(1)
memory usage: 5.8+ KB

prophet = Prophet(seasonality_mode = 'multiplicative',
                 yearly_seasonality = True, # 연별
                 weekly_seasonality = True, # 주별
                 daily_seasonality = True, # 일별
                 changepoint_prior_scale = 0.5) # 과적합 방지 0.5만큼 만 분석
prophet.fit(bitcoin_df) # 학습하기

# 5일 앞을 예측하기
future_data = prophet.make_future_dataframe(periods=5, freq='d')
# # 상한가 설정
# future_data['cap'] = 20000
# 예측
forecast_data = prophet.predict(future_data)
forecast_data

	ds	trend	yhat_lower	yhat_upper	trend_lower	trend_upper	daily	daily_lower	daily_upper	multiplicative_terms	...	weekly	weekly_lower	weekly_upper	yearly	yearly_lower	yearly_upper	additive_terms	additive_terms_lower	additive_terms_upper	yhat
0	2017-08-27	528.085585	3766.698129	5075.496722	528.085585	528.085585	9.711762	9.711762	9.711762	7.371717	...	-0.109233	-0.109233	-0.109233	-2.230812	-2.230812	-2.230812	0.0	0.0	0.0	4420.983246
1	2017-08-28	529.776373	3961.128742	5121.531273	529.776373	529.776373	9.711762	9.711762	9.711762	7.479669	...	-0.054572	-0.054572	-0.054572	-2.177521	-2.177521	-2.177521	0.0	0.0	0.0	4492.328178
2	2017-08-29	531.467162	3978.374754	5172.827221	531.467162	531.467162	9.711762	9.711762	9.711762	7.642893	...	0.067545	0.067545	0.067545	-2.136414	-2.136414	-2.136414	0.0	0.0	0.0	4593.413651
3	2017-08-30	533.157950	4013.934349	5196.960495	533.157950	533.157950	9.711762	9.711762	9.711762	7.611159	...	0.009555	0.009555	0.009555	-2.110158	-2.110158	-2.110158	0.0	0.0	0.0	4591.108025
4	2017-08-31	534.848738	4008.172911	5178.939144	534.848738	534.848738	9.711762	9.711762	9.711762	7.647029	...	0.036189	0.036189	0.036189	-2.100922	-2.100922	-2.100922	0.0	0.0	0.0	4624.852670
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
365	2018-08-27	818.078684	6243.626452	7506.907764	818.078684	818.078684	9.711762	9.711762	9.711762	7.411506	...	-0.054572	-0.054572	-0.054572	-2.245684	-2.245684	-2.245684	0.0	0.0	0.0	6881.273946
366	2018-08-28	822.999668	6453.325229	7704.060396	822.999668	822.999668	9.711762	9.711762	9.711762	7.589491	...	0.067545	0.067545	0.067545	-2.189816	-2.189816	-2.189816	0.0	0.0	0.0	7069.148295
367	2018-08-29	827.920652	6419.908313	7727.754695	825.682414	827.920652	9.711762	9.711762	9.711762	7.575922	...	0.009555	0.009555	0.009555	-2.145395	-2.145395	-2.145395	0.0	0.0	0.0	7100.183131
368	2018-08-30	832.841636	6536.153516	7786.172133	823.752948	834.318783	9.711762	9.711762	9.711762	7.632749	...	0.036189	0.036189	0.036189	-2.115202	-2.115202	-2.115202	0.0	0.0	0.0	7189.713129
369	2018-08-31	837.762619	6550.207854	7952.370851	816.207036	850.803956	9.711762	9.711762	9.711762	7.688080	...	0.077855	0.077855	0.077855	-2.101537	-2.101537	-2.101537	0.0	0.0	0.0	7278.549033
370 rows × 22 columns

# 그래프로 작성
fig = prophet.plot(forecast_data)

# 실제 데이터와 비교하기
# 예측데이터
pred_y = forecast_data.yhat.values[-5:]
pred_y 

# array([6881.2739463 , 7069.14829491, 7100.18313111, 7189.71312892,
#        7278.54903332])

# 실제데이터
test_file_path = 'market-price-test.csv'
bitcoin_test_df = pd.read_csv(test_file_path, names = ['ds', 'y'], header=0)
test_y = bitcoin_test_df.y.values

# 예측최소데이터
pred_y_lower = forecast_data.yhat_lower.values[-5:]

# 예측최대데이터
pred_y_upper = forecast_data.yhat_upper.values[-5:]

plt.plot(pred_y, color = 'gold') # 모델 예측한 가격그래프
plt.plot(pred_y_lower, color = 'red') # 모델이 예상한 최소 가격 그래프
plt.plot(pred_y_upper, color = 'blue') # 모델이 예상한 최대 가격 그래프
plt.plot(test_y, color = 'green') # 실제 가격 그래프

# 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from fbprophet import Prophet
file_path = 'market-price.csv'
bitcoin_df = pd.read_csv(file_path, names = ['ds', 'y'], header=0)
# 상한가 설정
bitcoin_df['cap'] = 20000
# 하한가 설정 
bitcoin_df['floor'] = 2000
# growth = logistic  상한설정시 추가, 비선형방식으로 분석
prophet = Prophet(seasonality_mode = 'multiplicative',
                  growth = 'logistic', # 상하한가 설정할때 , 비선형방식
                 yearly_seasonality = True, # 연별
                 weekly_seasonality = True, # 주별
                 daily_seasonality = True, # 일별
                 changepoint_prior_scale = 0.5) # 과적합 방지 0.5만큼 만 분석
prophet.fit(bitcoin_df) # 학습하기

# 5일 앞을 예측하기
future_data = prophet.make_future_dataframe(periods=5, freq='d')
# 상한가 설정
future_data['cap'] = 20000
future_data['floor'] = 2000
# 예측
forecast_data = prophet.predict(future_data)
forecast_data


ds	trend	cap	floor	yhat_lower	yhat_upper	trend_lower	trend_upper	daily	daily_lower	...	weekly	weekly_lower	weekly_upper	yearly	yearly_lower	yearly_upper	additive_terms	additive_terms_lower	additive_terms_upper	yhat
0	2017-08-27	5703.063125	20000	2000	3715.115421	5497.027367	5703.063125	5703.063125	0.426516	0.426516	...	0.003555	0.003555	0.003555	-0.626876	-0.626876	-0.626876	0.0	0.0	0.0	4580.676685
1	2017-08-28	5708.250950	20000	2000	3665.713289	5396.548303	5708.250950	5708.250950	0.426516	0.426516	...	-0.000994	-0.000994	-0.000994	-0.642812	-0.642812	-0.642812	0.0	0.0	0.0	4467.904518
2	2017-08-29	5713.444155	20000	2000	3501.918583	5273.509727	5713.444155	5713.444155	0.426516	0.426516	...	-0.000734	-0.000734	-0.000734	-0.659989	-0.659989	-0.659989	0.0	0.0	0.0	4375.317434
3	2017-08-30	5718.642741	20000	2000	3377.490794	5126.427715	5718.642741	5718.642741	0.426516	0.426516	...	-0.010622	-0.010622	-0.010622	-0.678244	-0.678244	-0.678244	0.0	0.0	0.0	4218.353837
4	2017-08-31	5723.846707	20000	2000	3284.264456	5033.597019	5723.846707	5723.846707	0.426516	0.426516	...	-0.008040	-0.008040	-0.008040	-0.697376	-0.697376	-0.697376	0.0	0.0	0.0	4127.470590
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
365	2018-08-27	7102.808324	20000	2000	4866.755803	6547.711110	7102.808324	7102.808324	0.426516	0.426516	...	-0.000994	-0.000994	-0.000994	-0.623101	-0.623101	-0.623101	0.0	0.0	0.0	5699.443947
366	2018-08-28	7106.076367	20000	2000	4743.519630	6436.151223	7106.076367	7106.076367	0.426516	0.426516	...	-0.000734	-0.000734	-0.000734	-0.638706	-0.638706	-0.638706	0.0	0.0	0.0	5593.025888
367	2018-08-29	7109.345673	20000	2000	4533.463676	6332.987745	7109.336045	7109.345673	0.426516	0.426516	...	-0.010622	-0.010622	-0.010622	-0.655587	-0.655587	-0.655587	0.0	0.0	0.0	5405.285348
368	2018-08-30	7112.616243	20000	2000	4408.976493	6155.807154	7112.547092	7112.630877	0.426516	0.426516	...	-0.008040	-0.008040	-0.008040	-0.673590	-0.673590	-0.673590	0.0	0.0	0.0	5298.091380
369	2018-08-31	7115.888076	20000	2000	4405.810985	6115.617608	7115.744311	7115.953331	0.426516	0.426516	...	0.000391	0.000391	0.000391	-0.692523	-0.692523	-0.692523	0.0	0.0	0.0	5225.796784
370 rows × 24 columns

# 그래프로 작성
fig = prophet.plot(forecast_data)

# 실제 데이터와 비교하기
# 예측데이터
pred_y = forecast_data.yhat.values[-5:]
pred_y 

# array([5699.44394662, 5593.02588819, 5405.28534766, 5298.09137955,
#        5225.79678411])

# 실제데이터
test_file_path = 'market-price-test.csv'
bitcoin_test_df = pd.read_csv(test_file_path, names = ['ds', 'y'], header=0)
test_y = bitcoin_test_df.y.values

# 예측최소데이터
pred_y_lower = forecast_data.yhat_lower.values[-5:]

# 예측최대데이터
pred_y_upper = forecast_data.yhat_upper.values[-5:]

plt.plot(pred_y, color = 'gold') # 모델 예측한 가격그래프
plt.plot(pred_y_lower, color = 'red') # 모델이 예상한 최소 가격 그래프
plt.plot(pred_y_upper, color = 'blue') # 모델이 예상한 최대 가격 그래프
plt.plot(test_y, color = 'green') # 실제 가격 그래프

저작자표시 비영리

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

31. titanic \|\| logistic (0)	2021.11.24
30. 보스턴 주택가격정보 \|\| 선형회귀 (0)	2021.11.24
28. 비트코인 가격 시계열 분석 \|\| Arima, fbProphet (0)	2021.11.24
27. 프로야구 연봉 예측 분석 \|\| OLS, Heatmap (0)	2021.11.24
26. 서울 중학교 졸업자 분석 \|\| dbscan, folium (0)	2021.11.24

28. 비트코인 가격 시계열 분석 || Arima, fbProphet

2021. 11. 24. 15:22

728x90

market-price.csv

0.01MB

# 시계열 데이터 : 연속적인 시간에 따라 다르게 측정데이터
# arima 모델 => statsmodel 모듈 이용
# ar 과거정보기준
# ma 이전정보의 오차를 현재 상태로 추론

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
file_path = 'market-price.csv'
bitcoin_df = pd.read_csv(file_path)
bitcoin_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Timestamp     365 non-null    object 
 1   market-price  365 non-null    float64
dtypes: float64(1), object(1)
memory usage: 5.8+ KB

bitcoin_df.head()

	Timestamp	market-price
0	2017-08-27 00:00:00	4354.308333
1	2017-08-28 00:00:00	4391.673517
2	2017-08-29 00:00:00	4607.985450
3	2017-08-30 00:00:00	4594.987850
4	2017-08-31 00:00:00	4748.255000

bitcoin_df = pd.read_csv('market-price.csv', names = ['day','price'], header=0)
bitcoin_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   day     365 non-null    object 
 1   price   365 non-null    float64
dtypes: float64(1), object(1)
memory usage: 5.8+ KB

bitcoin_df.head()

	day	price
0	2017-08-27 00:00:00	4354.308333
1	2017-08-28 00:00:00	4391.673517
2	2017-08-29 00:00:00	4607.985450
3	2017-08-30 00:00:00	4594.987850
4	2017-08-31 00:00:00	4748.255000

bitcoin_df.shape

# (365, 2)

bitcoin_df['day'] = pd.to_datetime(bitcoin_df['day'])
bitcoin_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   day     365 non-null    datetime64[ns]
 1   price   365 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 5.8 KB

bitcoin_df.describe()

	price
count	365.000000
mean	8395.863578
std	3239.804756
min	3319.630000
25%	6396.772500
50%	7685.633333
75%	9630.136277
max	19498.683333

bitcoin_df.plot()
plt.show()

# arima 모델 학습
# order = (2,1,2)
# 2 => ar 2번째 과거까지 알고리즘에 넣음
# 1 => difference 차분 정보, 현재상태 - 바로 이전의 상태 뺀값
# 시계열 데이터의 불규칙성을 파악 => 비트코인 ^^
# 2 => ma 2번째 과거정보오차를 이용해서 현재를 추론

from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(bitcoin_df.price.values, order=(2,1,2))
model_fit = model.fit(trend='c', full_output=True, disp=True)
fig = model_fit.plot_predict() # 학습 데이터에 대한 예측 결과 출력
risiduals = pd.DataFrame(model_fit.resid)# 잔차의 변동을 시각화
risiduals.plot()

# 실제데이터와 비교
# 이후 5일 정보 예측
forecast_data = model_fit.forecast(steps=5)
forecast_data
# 1 번 배열 : 예측값, 5일차 예측값
# 2 번 배열 : 표준오차, 5일차 예측값
# 3 번 배열 : 5개 배열 [예측데이터 하한값, 예측데이터 상한값]

(array([6676.91689529, 6685.04884511, 6690.29837254, 6697.35159419,
        6703.26452567]),
 array([ 512.41529746,  753.50414112,  914.97749885, 1061.45286959,
        1184.4382798 ]),
 array([[5672.60136715, 7681.23242343],
        [5208.20786632, 8161.8898239 ],
        [4896.97542813, 8483.62131695],
        [4616.94219851, 8777.76098987],
        [4381.80815535, 9024.720896  ]]))

# 실데이터 읽어오기
test_file_path = 'market-price-test.csv'
bitcoin_test_df = pd.read_csv(test_file_path, names = ['ds','y'], header=0)

# 예측값을 pred_y 변수에 리스트로 저장 // 원래 튜플
pred_y = forecast_data[0].tolist()
pred_y

[6676.9168952924865,
 6685.048845109902,
 6690.298372539306,
 6697.35159419041,
 6703.2645256732285]

# 실제값을 test_y 변수에 리스트로 저장하기
test_y = bitcoin_test_df['y'].values
test_y
pred_y_lower = [] # 최소 예측값
pred_y_upper = [] # 최대 예측값

for low_up in forecast_data[2] :
    pred_y_lower.append(low_up[0])
    pred_y_upper.append(low_up[1])
    
pred_y_lower
[5672.601367152579,
 5208.207866318599,
 4896.975428126821,
 4616.942198505993,
 4381.808155348637]

pred_y_upper
[7681.232423432394,
 8161.889823901204,
 8483.62131695179,
 8777.760989874827,
 9024.72089599782]

# 시각화
plt.plot(pred_y, color='gold') # 예측값
plt.plot(test_y, color='green') # 실제값 => 변동성있는 편
plt.plot(pred_y_lower, color='red') # 예측 최소값
plt.plot(pred_y_upper, color='blue') # 예측 최대값

# 시계열 데이터 분석을 위한 모델

# ar (자기회귀분석모델)

그냥시간이 들어가서 연속적인 => 주가분석 등
현재 자신과 과거의 자신을 비교, 관계
ar(n) => n 이전의 시점

# ma 이동평균모델

과거와 현재의 오차의 관계

# 합쳐서 자기회귀 이동모델

arma(자기회귀 이동평균 모델)
현재시점의 나와 과거시점의 나를 비교
현재시점의 차이를 비교

# arima 자기회귀 누적 이동평균 모델

ma 누적차수 ar 누적차수 동시에
현재와 추세간의 관계 정의
arma는 규칙적인 시계열데이터는 가능하지만 불규칙적인 시계열에 불리, 한계
보완하기위한 모델이다

# arima(p, d, q)
p : ar 모형 차수
d : 차분 : 차이
q : ma 모형 차수
# p+q가 짝수인게 좋다

# 페이스북 시계열 알고리즘 prophet

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from fbprophet import Prophet
file_path = 'market-price.csv'
bitcoin_df = pd.read_csv(file_path, names = ['ds', 'y'], header=0)
bitcoin_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ds      365 non-null    object 
 1   y       365 non-null    float64
dtypes: float64(1), object(1)
memory usage: 5.8+ KB

prophet = Prophet(seasonality_mode = 'multiplicative',
                 yearly_seasonality = True, # 연별
                 weekly_seasonality = True, # 주별
                 daily_seasonality = True, # 일별
                 changepoint_prior_scale = 0.5) # 과적합 방지 0.5만큼 만 분석
prophet.fit(bitcoin_df) # 학습하기

pip install pystan --upgrade

# 5일 앞을 예측하기
future_data = prophet.make_future_dataframe(periods=5, freq='d')
# 예측
forecast_data = prophet.predict(future_data)
forecast_data

	ds	trend	yhat_lower	yhat_upper	trend_lower	trend_upper	daily	daily_lower	daily_upper	multiplicative_terms	...	weekly	weekly_lower	weekly_upper	yearly	yearly_lower	yearly_upper	additive_terms	additive_terms_lower	additive_terms_upper	yhat
0	2017-08-27	473.569120	3776.764014	5104.491150	473.569120	473.569120	9.563964	9.563964	9.563964	8.356854	...	-0.038472	-0.038472	-0.038472	-1.168637	-1.168637	-1.168637	0.0	0.0	0.0	4431.117317
1	2017-08-28	476.933144	3833.197375	5183.393019	476.933144	476.933144	9.563964	9.563964	9.563964	8.436224	...	-0.006602	-0.006602	-0.006602	-1.121138	-1.121138	-1.121138	0.0	0.0	0.0	4500.447825
2	2017-08-29	480.297167	3877.729283	5211.107968	480.297167	480.297167	9.563964	9.563964	9.563964	8.494301	...	0.019974	0.019974	0.019974	-1.089637	-1.089637	-1.089637	0.0	0.0	0.0	4560.085805
3	2017-08-30	483.661190	3954.539662	5206.571586	483.661190	483.661190	9.563964	9.563964	9.563964	8.440425	...	-0.046634	-0.046634	-0.046634	-1.076905	-1.076905	-1.076905	0.0	0.0	0.0	4565.966993
4	2017-08-31	487.025213	3931.103385	5269.222802	487.025213	487.025213	9.563964	9.563964	9.563964	8.461194	...	-0.017649	-0.017649	-0.017649	-1.085122	-1.085122	-1.085122	0.0	0.0	0.0	4607.839822
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
365	2018-08-27	738.543896	6218.124910	7629.182104	738.543896	738.543896	9.563964	9.563964	9.563964	8.374726	...	-0.006602	-0.006602	-0.006602	-1.182636	-1.182636	-1.182636	0.0	0.0	0.0	6923.647020
366	2018-08-28	742.612648	6338.876532	7721.930490	742.612648	742.612648	9.563964	9.563964	9.563964	8.452304	...	0.019974	0.019974	0.019974	-1.131634	-1.131634	-1.131634	0.0	0.0	0.0	7019.400574
367	2018-08-29	746.681400	6371.510730	7768.115586	746.681400	752.202325	9.563964	9.563964	9.563964	8.421478	...	-0.046634	-0.046634	-0.046634	-1.095851	-1.095851	-1.095851	0.0	0.0	0.0	7034.842537
368	2018-08-30	750.750152	6374.620387	7883.157582	748.285679	770.190606	9.563964	9.563964	9.563964	8.468117	...	-0.017649	-0.017649	-0.017649	-1.078198	-1.078198	-1.078198	0.0	0.0	0.0	7108.190099
369	2018-08-31	754.818904	6440.287682	7941.401382	742.825041	785.906885	9.563964	9.563964	9.563964	8.518827	...	0.035872	0.035872	0.035872	-1.081008	-1.081008	-1.081008	0.0	0.0	0.0	7184.990775
370 rows × 22 columns

forecast_data.shape

# (370, 22)

# 예측된 데이터의 날짜 , 예측값, 최소 예측값, 최대 예측값
forecast_data[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail(5)

	ds	yhat	yhat_lower	yhat_upper
365	2018-08-27	6923.647020	6218.124910	7629.182104
366	2018-08-28	7019.400574	6338.876532	7721.930490
367	2018-08-29	7034.842537	6371.510730	7768.115586
368	2018-08-30	7108.190099	6374.620387	7883.157582
369	2018-08-31	7184.990775	6440.287682	7941.401382

# 결과 시각화
fig1 = prophet.plot(forecast_data)
# 검은점 : 실데이터
# 파란선 : 예측값

fig2 = prophet.plot_components(forecast_data)
# 4개의 데이터
# 원 데이터
# 주별
# 연별
# 시간별

# 예측이니 성능 분석도 해야함 
# 실제값 - 예측값 
y = bitcoin_df.y.values[5:] # 실데이터, 첫5일제외, 
y_pred = forecast_data.yhat.values[5:-5] #첫5일, 막5일 제외한 예측데이터

# r2score rmse
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt
r2 = r2_score(y, y_pred)
r2
# 0.9737786665877044

rmse = sqrt(mean_squared_error(y, y_pred))
rmse

# 522.2899311292591

# 실데이터와 비교
test_file_path = 'market-price-test.csv'
bitcoin_test_df = pd.read_csv(test_file_path, names = ['ds','y'], header=0)
bitcoin_test_df

	ds	y
0	2018-08-27 00:00:00	6719.266154
1	2018-08-28 00:00:00	7000.040000
2	2018-08-29 00:00:00	7054.276429
3	2018-08-30 00:00:00	6932.662500
4	2018-08-31 00:00:00	6981.946154

y = bitcoin_test_df.y.values
y
array([6719.26615385, 7000.04      , 7054.27642857, 6932.6625    ,
       6981.94615385])

y_pred = forecast_data.yhat.values[-5:]
y_pred 

array([6923.64702007, 7019.40057427, 7034.84253693, 7108.19009905,
       7184.99077545])

plt.plot(y_pred, color = 'gold')
plt.plot(y, color = 'green')

저작자표시 비영리

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

30. 보스턴 주택가격정보 \|\| 선형회귀 (0)	2021.11.24
29. 비트코인 시계열 분석 \|\| prophet (0)	2021.11.24
27. 프로야구 연봉 예측 분석 \|\| OLS, Heatmap (0)	2021.11.24
26. 서울 중학교 졸업자 분석 \|\| dbscan, folium (0)	2021.11.24
25. 판매 데이터 분석 \|\| kmeans (0)	2021.11.24

PREV 1 ···6 7 8 9 10 11 12 NEXT

My_Flow

Data_Science

37. iris || Kmeans

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

36. 강남역 고기집 후기분석 || 감성분석

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

35. 강남역 고기집 감성분석 || 감성분석, TF-IDF

감성분석

# TF-IDF 변환

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

34. 강남역 고기집 후기분석 || 맵크로울링

맵 크로울링

감성분석

# TF-IDF 변환

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

33. white wine || GBM

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

32. titanic || GBM

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

31. titanic || logistic

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

30. 보스턴 주택가격정보 || 선형회귀

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

29. 비트코인 시계열 분석 || prophet

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

28. 비트코인 가격 시계열 분석 || Arima, fbProphet

# 시계열 데이터 분석을 위한 모델

# ar (자기회귀분석모델)

# ma 이동평균모델

# arima 자기회귀 누적 이동평균 모델

# 페이스북 시계열 알고리즘 prophet

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

+ Recent posts

티스토리툴바