My_Flow

전체 글

8-7. Opinion Review || 문서 군집화 2022.01.02
8-6. 토픽 모델링(Topic Modeling) - 20 뉴스그룹 2022.01.02
8-5. Sentiment Analysis || SentiWordNet 2022.01.02
8-4. IMDB 영화평 || 지도학습 기반 감성 분석 2022.01.02
8-3. 뉴스그룹 분류 2022.01.02
8-2. Bag of Words 2022.01.02
8-1.텍스트 전처리 2022.01.02
7-10. 고객 세그맨테이션 || clustering 2021.12.30
7-9. DBSCAN 2 2021.12.30
7-8. DBSCAN 2021.12.30

8-7. Opinion Review || 문서 군집화

2022. 1. 2. 20:24

728x90

Opinion Review 데이터 세트를 이용한 문서 군집화 수행하기

데이터 로딩

import pandas as pd
import glob ,os

# 아래는 제 컴퓨터에서 압축 파일을 풀어 놓은 디렉토리이니, 여러분의 디렉토리를 설정해 주십시요  
path = r'C:\Users\pc\Machine Learning P Guide\data\OpinosisDataset1.0\OpinosisDataset1.0\topics'                     
# path로 지정한 디렉토리 밑에 있는 모든 .data 파일들의 파일명을 리스트로 취합
all_files = glob.glob(os.path.join(path, "*.data"))    
filename_list = []
opinion_text = []

# 개별 파일들의 파일명은 filename_list 리스트로 취합, 
# 개별 파일들의 파일내용은 DataFrame로딩 후 다시 string으로 변환하여 opinion_text 리스트로 취합 
for file_ in all_files:
    # 개별 파일을 읽어서 DataFrame으로 생성 
    df = pd.read_table(file_,index_col=None, header=0,encoding='latin1')
    
    # 절대경로로 주어진 file 명을 가공. 만일 Linux에서 수행시에는 아래 \\를 / 변경. 맨 마지막 .data 확장자도 제거
    filename_ = file_.split('\\')[-1]
    filename = filename_.split('.')[0]

    #파일명 리스트와 파일내용 리스트에 파일명과 파일 내용을 추가. 
    filename_list.append(filename)
    opinion_text.append(df.to_string())

# 파일명 리스트와 파일내용 리스트를  DataFrame으로 생성
document_df = pd.DataFrame({'filename':filename_list, 'opinion_text':opinion_text})
document_df.head()


filename	opinion_text
0	accuracy_garmin_nuvi_255W_gps	, and is very, very acc...
1	bathroom_bestwestern_hotel_sfo	The room was not overly big, but clean and...
2	battery-life_amazon_kindle	After I plugged it in to my USB hub on my ...
3	battery-life_ipod_nano_8gb	short battery life I moved up from a...
4	battery-life_netbook_1005ha	6GHz 533FSB cpu, glossy display, 3, Cell 2...

Lemmatization을 위한 함수 생성

from nltk.stem import WordNetLemmatizer
import nltk
import string

# nltk는 
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
lemmar = WordNetLemmatizer()

def LemTokens(tokens):
    return [lemmar.lemmatize(token) for token in tokens]

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

TF-IDF 피처 벡터화, TfidfVectorizer에서 피처 벡터화 수행 시 Lemmatization을 적용하여 토큰화

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english' , \
                             ngram_range=(1,2), min_df=0.05, max_df=0.85 )

#opinion_text 컬럼값으로 feature vectorization 수행
feature_vect = tfidf_vect.fit_transform(document_df['opinion_text'])

5개의 군집으로 K-Means군집화

from sklearn.cluster import KMeans

# 5개 집합으로 군집화 수행. 예제를 위해 동일한 클러스터링 결과 도출용 random_state=0 
km_cluster = KMeans(n_clusters=5, max_iter=10000, random_state=0)
km_cluster.fit(feature_vect)
cluster_label = km_cluster.labels_
cluster_centers = km_cluster.cluster_centers_

군집화된 그룹별로 데이터 확인

document_df['cluster_label'] = cluster_label
document_df.head()


filename	opinion_text	cluster_label
0	accuracy_garmin_nuvi_255W_gps	, and is very, very acc...	2
1	bathroom_bestwestern_hotel_sfo	The room was not overly big, but clean and...	0
2	battery-life_amazon_kindle	After I plugged it in to my USB hub on my ...	1
3	battery-life_ipod_nano_8gb	short battery life I moved up from a...	1
4	battery-life_netbook_1005ha	6GHz 533FSB cpu, glossy display, 3, Cell 2...	1

document_df[document_df['cluster_label']==0].sort_values(by='filename')

filename	opinion_text	cluster_label
1	bathroom_bestwestern_hotel_sfo	The room was not overly big, but clean and...	0
32	room_holiday_inn_london	We arrived at 23,30 hours and they could n...	0
30	rooms_bestwestern_hotel_sfo	Great Location , Nice Rooms , Helpless...	0
31	rooms_swissotel_chicago	The Swissotel is one of our favorite hotel...	0

document_df[document_df['cluster_label']==1].sort_values(by='filename')


filename	opinion_text	cluster_label
2	battery-life_amazon_kindle	After I plugged it in to my USB hub on my ...	1
3	battery-life_ipod_nano_8gb	short battery life I moved up from a...	1
4	battery-life_netbook_1005ha	6GHz 533FSB cpu, glossy display, 3, Cell 2...	1
19	keyboard_netbook_1005ha	, I think the new keyboard rivals the gre...	1
26	performance_netbook_1005ha	The Eee Super Hybrid Engine utility lets u...	1
42	sound_ipod_nano_8gb	headphone jack i got a clear case for it a...	1
44	speed_windows7	Windows 7 is quite simply faster, more sta...	1

document_df[document_df['cluster_label']==2].sort_values(by='filename')

filename	opinion_text	cluster_label
0	accuracy_garmin_nuvi_255W_gps	, and is very, very acc...	2
5	buttons_amazon_kindle	I thought it would be fitting to christen ...	2
8	directions_garmin_nuvi_255W_gps	You also get upscale features like spoken ...	2
9	display_garmin_nuvi_255W_gps	3 quot widescreen display was a ...	2
10	eyesight-issues_amazon_kindle	It feels as easy to read as the K1 but doe...	2
11	features_windows7	I had to uninstall anti, virus and selecte...	2
12	fonts_amazon_kindle	Being able to change the font sizes is aw...	2
23	navigation_amazon_kindle	In fact, the entire navigation structure h...	2
33	satellite_garmin_nuvi_255W_gps	It's fast to acquire satel...	2
34	screen_garmin_nuvi_255W_gps	It is easy to read and when touching the...	2
35	screen_ipod_nano_8gb	As always, the video screen is sharp and b...	2
36	screen_netbook_1005ha	Keep in mind that once you get in a room ...	2
41	size_asus_netbook_1005ha	A few other things I'd like to point out i...	2
43	speed_garmin_nuvi_255W_gps	Another feature on the 255w is a display of...	2
48	updates_garmin_nuvi_255W_gps	Another thing to consider was that I paid $...	2
49	video_ipod_nano_8gb	I bought the 8, gig Ipod Nano that has the...	2
50	voice_garmin_nuvi_255W_gps	The voice prompts and maps are wonderful ...	2

document_df[document_df['cluster_label']==3].sort_values(by='filename')

filename	opinion_text	cluster_label
13	food_holiday_inn_london	The room was packed to capacity with queu...	3
14	food_swissotel_chicago	The food for our event was deli...	3
15	free_bestwestern_hotel_sfo	The wine reception is a great idea as it i...	3
20	location_bestwestern_hotel_sfo	Good Value good location , ideal ...	3
21	location_holiday_inn_london	Great location for tube and we crammed in...	3
24	parking_bestwestern_hotel_sfo	Parking was expensive but I think this is ...	3
27	price_amazon_kindle	If a case was included, as with the Kindle...	3
28	price_holiday_inn_london	All in all, a normal chain hotel on a nice...	3
38	service_bestwestern_hotel_sfo	Both of us having worked in tourism for o...	3
39	service_holiday_inn_london	not customer, oriented hotelvery low servi...	3
40	service_swissotel_hotel_chicago	Mediocre room and service for a very extr...	3
45	staff_bestwestern_hotel_sfo	Staff are friendly and hel...	3
46	staff_swissotel_chicago	The staff at Swissotel were not particula...	3

document_df[document_df['cluster_label']==4].sort_values(by='filename')


filename	opinion_text	cluster_label
6	comfort_honda_accord_2008	Drivers seat not comfortable, the car its...	4
7	comfort_toyota_camry_2007	Ride seems comfortable and gas mileage fa...	4
16	gas_mileage_toyota_camry_2007	Ride seems comfortable and gas mileage fa...	4
17	interior_honda_accord_2008	I love the new body style and the interior...	4
18	interior_toyota_camry_2007	First of all, the interior has way too ma...	4
22	mileage_honda_accord_2008	It's quiet, get good gas mileage and look...	4
25	performance_honda_accord_2008	Very happy with my 08 Accord, performance i...	4
29	quality_toyota_camry_2007	I previously owned a Toyota 4Runner which ...	4
37	seats_honda_accord_2008	Front seats are very uncomfor...	4
47	transmission_toyota_camry_2007	After slowing down, transmission has to b...	4

from sklearn.cluster import KMeans

# 3개의 집합으로 군집화 
km_cluster = KMeans(n_clusters=3, max_iter=10000, random_state=0)
km_cluster.fit(feature_vect)
cluster_label = km_cluster.labels_


# 소속 클러스터를 cluster_label 컬럼으로 할당하고 cluster_label 값으로 정렬
document_df['cluster_label'] = cluster_label
document_df.sort_values(by='cluster_label')

filename	opinion_text	cluster_label
0	accuracy_garmin_nuvi_255W_gps	, and is very, very acc...	0
48	updates_garmin_nuvi_255W_gps	Another thing to consider was that I paid $...	0
44	speed_windows7	Windows 7 is quite simply faster, more sta...	0
43	speed_garmin_nuvi_255W_gps	Another feature on the 255w is a display of...	0
42	sound_ipod_nano_8gb	headphone jack i got a clear case for it a...	0
41	size_asus_netbook_1005ha	A few other things I'd like to point out i...	0
36	screen_netbook_1005ha	Keep in mind that once you get in a room ...	0
35	screen_ipod_nano_8gb	As always, the video screen is sharp and b...	0
34	screen_garmin_nuvi_255W_gps	It is easy to read and when touching the...	0
33	satellite_garmin_nuvi_255W_gps	It's fast to acquire satel...	0
27	price_amazon_kindle	If a case was included, as with the Kindle...	0
26	performance_netbook_1005ha	The Eee Super Hybrid Engine utility lets u...	0
49	video_ipod_nano_8gb	I bought the 8, gig Ipod Nano that has the...	0
23	navigation_amazon_kindle	In fact, the entire navigation structure h...	0
19	keyboard_netbook_1005ha	, I think the new keyboard rivals the gre...	0
50	voice_garmin_nuvi_255W_gps	The voice prompts and maps are wonderful ...	0
9	display_garmin_nuvi_255W_gps	3 quot widescreen display was a ...	0
4	battery-life_netbook_1005ha	6GHz 533FSB cpu, glossy display, 3, Cell 2...	0
3	battery-life_ipod_nano_8gb	short battery life I moved up from a...	0
2	battery-life_amazon_kindle	After I plugged it in to my USB hub on my ...	0
8	directions_garmin_nuvi_255W_gps	You also get upscale features like spoken ...	0
10	eyesight-issues_amazon_kindle	It feels as easy to read as the K1 but doe...	0
11	features_windows7	I had to uninstall anti, virus and selecte...	0
12	fonts_amazon_kindle	Being able to change the font sizes is aw...	0
5	buttons_amazon_kindle	I thought it would be fitting to christen ...	0
13	food_holiday_inn_london	The room was packed to capacity with queu...	1
39	service_holiday_inn_london	not customer, oriented hotelvery low servi...	1
38	service_bestwestern_hotel_sfo	Both of us having worked in tourism for o...	1
1	bathroom_bestwestern_hotel_sfo	The room was not overly big, but clean and...	1
14	food_swissotel_chicago	The food for our event was deli...	1
20	location_bestwestern_hotel_sfo	Good Value good location , ideal ...	1
24	parking_bestwestern_hotel_sfo	Parking was expensive but I think this is ...	1
15	free_bestwestern_hotel_sfo	The wine reception is a great idea as it i...	1
31	rooms_swissotel_chicago	The Swissotel is one of our favorite hotel...	1
30	rooms_bestwestern_hotel_sfo	Great Location , Nice Rooms , Helpless...	1
45	staff_bestwestern_hotel_sfo	Staff are friendly and hel...	1
40	service_swissotel_hotel_chicago	Mediocre room and service for a very extr...	1
21	location_holiday_inn_london	Great location for tube and we crammed in...	1
46	staff_swissotel_chicago	The staff at Swissotel were not particula...	1
32	room_holiday_inn_london	We arrived at 23,30 hours and they could n...	1
28	price_holiday_inn_london	All in all, a normal chain hotel on a nice...	1
47	transmission_toyota_camry_2007	After slowing down, transmission has to b...	2
16	gas_mileage_toyota_camry_2007	Ride seems comfortable and gas mileage fa...	2
6	comfort_honda_accord_2008	Drivers seat not comfortable, the car its...	2
7	comfort_toyota_camry_2007	Ride seems comfortable and gas mileage fa...	2
29	quality_toyota_camry_2007	I previously owned a Toyota 4Runner which ...	2
22	mileage_honda_accord_2008	It's quiet, get good gas mileage and look...	2
18	interior_toyota_camry_2007	First of all, the interior has way too ma...	2
17	interior_honda_accord_2008	I love the new body style and the interior...	2
37	seats_honda_accord_2008	Front seats are very uncomfor...	2
25	performance_honda_accord_2008	Very happy with my 08 Accord, performance i...	2

군집(Cluster)별 핵심 단어 추출하기

feature_vect.shape

KMeans객체의 cluster_centers_ 속성은 개별 피처들의 클러스터 중심과의 상대 위치를 정규화된 숫자값으로 표시

0~1까지의 값으로 표현되며 1에 가까울 수록 중심에 더 가깝다는 의미

cluster_centers = km_cluster.cluster_centers_
print('cluster_centers shape :',cluster_centers.shape)
print(cluster_centers)

cluster_centers shape : (3, 2409)
[[0.01819865 0.         0.         ... 0.         0.         0.00471073]
 [0.         0.00170335 0.0025537  ... 0.0032582  0.00349413 0.        ]
 [0.         0.00137309 0.         ... 0.         0.         0.        ]]

군집별 top n 핵심단어, 그 단어의 중심 위치 상대값, 대상 파일명들을 반환하는 함수 생성

# 군집별 top n 핵심단어, 그 단어의 중심 위치 상대값, 대상 파일명들을 반환함. 
def get_cluster_details(cluster_model, cluster_data, feature_names, clusters_num, top_n_features=10):
    cluster_details = {}
    
    # cluster_centers array 의 값이 큰 순으로 정렬된 index 값을 반환
    # 군집 중심점(centroid)별 할당된 word 피처들의 거리값이 큰 순으로 값을 구하기 위함.  
    centroid_feature_ordered_ind = cluster_model.cluster_centers_.argsort()[:,::-1]
    
    #개별 군집별로 iteration하면서 핵심단어, 그 단어의 중심 위치 상대값, 대상 파일명 입력
    for cluster_num in range(clusters_num):
        # 개별 군집별 정보를 담을 데이터 초기화. 
        cluster_details[cluster_num] = {}
        cluster_details[cluster_num]['cluster'] = cluster_num
        
        # cluster_centers_.argsort()[:,::-1] 로 구한 index 를 이용하여 top n 피처 단어를 구함. 
        top_feature_indexes = centroid_feature_ordered_ind[cluster_num, :top_n_features]
        top_features = [ feature_names[ind] for ind in top_feature_indexes ]
        
        # top_feature_indexes를 이용해 해당 피처 단어의 중심 위치 상댓값 구함 
        top_feature_values = cluster_model.cluster_centers_[cluster_num, top_feature_indexes].tolist()
        
        # cluster_details 딕셔너리 객체에 개별 군집별 핵심 단어와 중심위치 상대값, 그리고 해당 파일명 입력
        cluster_details[cluster_num]['top_features'] = top_features
        cluster_details[cluster_num]['top_features_value'] = top_feature_values
        filenames = cluster_data[cluster_data['cluster_label'] == cluster_num]['filename']
        filenames = filenames.values.tolist()
        cluster_details[cluster_num]['filenames'] = filenames
        
    return cluster_details

클러스터별 top feature들의 단어와 파일명 출력``

def print_cluster_details(cluster_details):
    for cluster_num, cluster_detail in cluster_details.items():
        print('####### Cluster {0}'.format(cluster_num))
        print('Top features:', cluster_detail['top_features'])
        print('Reviews 파일명 :',cluster_detail['filenames'][:7])
        print('==================================================')

feature_names = tfidf_vect.get_feature_names()

cluster_details = get_cluster_details(cluster_model=km_cluster, cluster_data=document_df,\
                                  feature_names=feature_names, clusters_num=3, top_n_features=10 )
print_cluster_details(cluster_details)


####### Cluster 0
Top features: ['screen', 'battery', 'life', 'battery life', 'keyboard', 'kindle', 'size', 'button', 'easy', 'voice']
Reviews 파일명 : ['accuracy_garmin_nuvi_255W_gps', 'battery-life_amazon_kindle', 'battery-life_ipod_nano_8gb', 'battery-life_netbook_1005ha', 'buttons_amazon_kindle', 'directions_garmin_nuvi_255W_gps', 'display_garmin_nuvi_255W_gps']
==================================================
####### Cluster 1
Top features: ['room', 'hotel', 'service', 'location', 'staff', 'food', 'clean', 'bathroom', 'parking', 'room wa']
Reviews 파일명 : ['bathroom_bestwestern_hotel_sfo', 'food_holiday_inn_london', 'food_swissotel_chicago', 'free_bestwestern_hotel_sfo', 'location_bestwestern_hotel_sfo', 'location_holiday_inn_london', 'parking_bestwestern_hotel_sfo']
==================================================
####### Cluster 2
Top features: ['interior', 'seat', 'mileage', 'comfortable', 'car', 'gas', 'transmission', 'gas mileage', 'ride', 'comfort']
Reviews 파일명 : ['comfort_honda_accord_2008', 'comfort_toyota_camry_2007', 'gas_mileage_toyota_camry_2007', 'interior_honda_accord_2008', 'interior_toyota_camry_2007', 'mileage_honda_accord_2008', 'performance_honda_accord_2008']
==================================================

저작자표시 비영리

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

8.9 네이버 영화 평점 감성 분석 \|\| 한글 텍스트 처리 (0)	2022.01.02
8-8. 문서 유사도 (0)	2022.01.02
8-6. 토픽 모델링(Topic Modeling) - 20 뉴스그룹 (0)	2022.01.02
8-5. Sentiment Analysis \|\| SentiWordNet (0)	2022.01.02
8-4. IMDB 영화평 \|\| 지도학습 기반 감성 분석 (0)	2022.01.02

8-6. 토픽 모델링(Topic Modeling) - 20 뉴스그룹

2022. 1. 2. 19:37

728x90

20 Newsgroup 토픽 모델링

20개 중 8개의 주제 데이터 로드 및 Count기반 피처 벡터화. LDA는 Count기반 Vectorizer만 적용합니다

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# 모토사이클, 야구, 그래픽스, 윈도우즈, 중동, 기독교, 전자공학, 의학 등 8개 주제를 추출. 
cats = ['rec.motorcycles', 'rec.sport.baseball', 'comp.graphics', 'comp.windows.x',
        'talk.politics.mideast', 'soc.religion.christian', 'sci.electronics', 'sci.med'  ]

# 위에서 cats 변수로 기재된 category만 추출. featch_20newsgroups( )의 categories에 cats 입력
news_df= fetch_20newsgroups(subset='all',remove=('headers', 'footers', 'quotes'), 
                            categories=cats, random_state=0)

#LDA 는 Count기반의 Vectorizer만 적용합니다.  
count_vect = CountVectorizer(max_df=0.95, max_features=1000, min_df=2, stop_words='english', ngram_range=(1,2))
feat_vect = count_vect.fit_transform(news_df.data)
print('CountVectorizer Shape:', feat_vect.shape)




# CountVectorizer Shape: (7862, 1000)

LDA 객체 생성 후 Count 피처 벡터화 객체로 LDA수행

lda = LatentDirichletAllocation(n_components=8, random_state=0)
lda.fit(feat_vect)

# LatentDirichletAllocation(n_components=8, random_state=0)

각 토픽 모델링 주제별 단어들의 연관도 확인
lda객체의 components_ 속성은 주제별로 개별 단어들의 연관도 정규화 숫자가 들어있음

shape는 주제 개수 X 피처 단어 개수

components_ 에 들어 있는 숫자값은 각 주제별로 단어가 나타난 횟수를 정규화 하여 나타냄.

숫자가 클 수록 토픽에서 단어가 차지하는 비중이 높음

print(lda.components_.shape)
lda.components_

(8, 1000)
array([[3.60992018e+01, 1.35626798e+02, 2.15751867e+01, ...,
        3.02911688e+01, 8.66830093e+01, 6.79285199e+01],
       [1.25199920e-01, 1.44401815e+01, 1.25045596e-01, ...,
        1.81506995e+02, 1.25097844e-01, 9.39593286e+01],
       [3.34762663e+02, 1.25176265e-01, 1.46743299e+02, ...,
        1.25105772e-01, 3.63689741e+01, 1.25025218e-01],
       ...,
       [3.60204965e+01, 2.08640688e+01, 4.29606813e+00, ...,
        1.45056650e+01, 8.33854413e+00, 1.55690009e+01],
       [1.25128711e-01, 1.25247756e-01, 1.25005143e-01, ...,
        9.17278769e+01, 1.25177668e-01, 3.74575887e+01],
       [5.49258690e+01, 4.47009532e+00, 9.88524814e+00, ...,
        4.87048440e+01, 1.25034678e-01, 1.25074632e-01]])

각 토픽별 중심 단어 확인

def display_topic_words(model, feature_names, no_top_words):
    for topic_index, topic in enumerate(model.components_):
        print('\nTopic #',topic_index)

        # components_ array에서 가장 값이 큰 순으로 정렬했을 때, 그 값의 array index를 반환. 
        topic_word_indexes = topic.argsort()[::-1]
        top_indexes=topic_word_indexes[:no_top_words]
        
        # top_indexes대상인 index별로 feature_names에 해당하는 word feature 추출 후 join으로 concat
        feature_concat = ' + '.join([str(feature_names[i])+'*'+str(round(topic[i],1)) for i in top_indexes])                
        print(feature_concat)

# CountVectorizer객체내의 전체 word들의 명칭을 get_features_names( )를 통해 추출
feature_names = count_vect.get_feature_names()

# Topic별 가장 연관도가 높은 word를 15개만 추출
display_topic_words(lda, feature_names, 15)

# 모토사이클, 야구, 그래픽스, 윈도우즈, 중동, 기독교, 전자공학, 의학 등 8개 주제를 추출. 


Topic # 0
year*703.2 + 10*563.6 + game*476.3 + medical*413.2 + health*377.4 + team*346.8 + 12*343.9 + 20*340.9 + disease*332.1 + cancer*319.9 + 1993*318.3 + games*317.0 + years*306.5 + patients*299.8 + good*286.3

Topic # 1
don*1454.3 + just*1392.8 + like*1190.8 + know*1178.1 + people*836.9 + said*802.5 + think*799.7 + time*754.2 + ve*676.3 + didn*675.9 + right*636.3 + going*625.4 + say*620.7 + ll*583.9 + way*570.3

Topic # 2
image*1047.7 + file*999.1 + jpeg*799.1 + program*495.6 + gif*466.0 + images*443.7 + output*442.3 + format*442.3 + files*438.5 + color*406.3 + entry*387.6 + 00*334.8 + use*308.5 + bit*308.4 + 03*258.7

Topic # 3
like*620.7 + know*591.7 + don*543.7 + think*528.4 + use*514.3 + does*510.2 + just*509.1 + good*425.8 + time*417.4 + book*410.7 + read*402.9 + information*395.2 + people*393.5 + used*388.2 + post*368.4

Topic # 4
armenian*960.6 + israel*815.9 + armenians*699.7 + jews*690.9 + turkish*686.1 + people*653.0 + israeli*476.1 + jewish*467.0 + government*464.4 + war*417.8 + dos dos*401.1 + turkey*393.5 + arab*386.1 + armenia*346.3 + 000*345.2

Topic # 5
edu*1613.5 + com*841.4 + available*761.5 + graphics*708.0 + ftp*668.1 + data*517.9 + pub*508.2 + motif*460.4 + mail*453.3 + widget*447.4 + software*427.6 + mit*421.5 + information*417.3 + version*413.7 + sun*402.4

Topic # 6
god*2013.0 + people*721.0 + jesus*688.7 + church*663.0 + believe*563.0 + christ*553.1 + does*500.1 + christian*474.8 + say*468.6 + think*446.0 + christians*443.5 + bible*422.9 + faith*420.1 + sin*396.5 + life*371.2

Topic # 7
use*685.8 + dos*635.0 + thanks*596.0 + windows*548.7 + using*486.5 + window*483.1 + does*456.2 + display*389.1 + help*385.2 + like*382.8 + problem*375.7 + server*370.2 + need*366.3 + know*355.5 + run*315.3

개별 문서별 토픽 분포 확인

lda객체의 transform()을 수행하면 개별 문서별 토픽 분포를 반환함.

doc_topics = lda.transform(feat_vect)
print(doc_topics.shape)
print(doc_topics[:3])


(7862, 8)
[[0.01389701 0.01394362 0.01389104 0.48221844 0.01397882 0.01389205
  0.01393501 0.43424401]
 [0.27750436 0.18151826 0.0021208  0.53037189 0.00212129 0.00212102
  0.00212113 0.00212125]
 [0.00544459 0.22166575 0.00544539 0.00544528 0.00544039 0.00544168
  0.00544182 0.74567512]]

개별 문서별 토픽 분포도를 출력

20newsgroup으로 만들어진 문서명을 출력.

fetch_20newsgroups()으로 만들어진 데이터의 filename속성은 모든 문서의 문서명을 가지고 있음.

filename속성은 절대 디렉토리를 가지는 문서명을 가지고 있으므로 '\'로 분할하여 맨 마지막 두번째 부터 파일명으로 가져옴

def get_filename_list(newsdata):
    filename_list=[]

    for file in newsdata.filenames:
            #print(file)
            filename_temp = file.split('\\')[-2:]
            filename = '.'.join(filename_temp)
            filename_list.append(filename)
    
    return filename_list

filename_list = get_filename_list(news_df)
print("filename 개수:",len(filename_list), "filename list 10개만:",filename_list[:10])


filename 개수: 7862 filename list 10개만: ['soc.religion.christian.20630', 'sci.med.59422', 'comp.graphics.38765', 'comp.graphics.38810', 'sci.med.59449', 'comp.graphics.38461', 'comp.windows.x.66959', 'rec.motorcycles.104487', 'sci.electronics.53875', 'sci.electronics.53617']

DataFrame으로 생성하여 문서별 토픽 분포도 확인

import pandas as pd 

topic_names = ['Topic #'+ str(i) for i in range(0, 8)]
doc_topic_df = pd.DataFrame(data=doc_topics, columns=topic_names, index=filename_list)
doc_topic_df.head(20)


	Topic #0	Topic #1	Topic #2	Topic #3	Topic #4	Topic #5	Topic #6	Topic #7
soc.religion.christian.20630	0.013897	0.013944	0.013891	0.482218	0.013979	0.013892	0.013935	0.434244
sci.med.59422	0.277504	0.181518	0.002121	0.530372	0.002121	0.002121	0.002121	0.002121
comp.graphics.38765	0.005445	0.221666	0.005445	0.005445	0.005440	0.005442	0.005442	0.745675
comp.graphics.38810	0.005439	0.005441	0.005449	0.578959	0.005440	0.388387	0.005442	0.005442
sci.med.59449	0.006584	0.552000	0.006587	0.408485	0.006585	0.006585	0.006588	0.006585
comp.graphics.38461	0.008342	0.008352	0.182622	0.767314	0.008335	0.008341	0.008343	0.008351
comp.windows.x.66959	0.372861	0.041667	0.377020	0.041668	0.041703	0.041703	0.041667	0.041711
rec.motorcycles.104487	0.225351	0.674669	0.004814	0.075920	0.004812	0.004812	0.004812	0.004810
sci.electronics.53875	0.008944	0.836686	0.008932	0.008941	0.008935	0.109691	0.008932	0.008938
sci.electronics.53617	0.041733	0.041720	0.708081	0.041742	0.041671	0.041669	0.041699	0.041686
sci.electronics.54089	0.001647	0.512634	0.001647	0.152375	0.001645	0.001649	0.001647	0.326757
rec.sport.baseball.102713	0.982653	0.000649	0.013455	0.000649	0.000648	0.000648	0.000649	0.000649
rec.sport.baseball.104711	0.288554	0.007358	0.007364	0.596561	0.078082	0.007363	0.007360	0.007358
comp.graphics.38232	0.044939	0.138461	0.375098	0.003914	0.003909	0.003911	0.003912	0.425856
sci.electronics.52732	0.017944	0.874782	0.017869	0.017904	0.017867	0.017866	0.017884	0.017885
talk.politics.mideast.76440	0.003381	0.003385	0.003381	0.843991	0.135716	0.003380	0.003384	0.003382
sci.med.59243	0.491684	0.486865	0.003574	0.003577	0.003578	0.003574	0.003574	0.003574
talk.politics.mideast.75888	0.015639	0.499140	0.015641	0.015683	0.015640	0.406977	0.015644	0.015636
soc.religion.christian.21526	0.002455	0.164735	0.002455	0.002456	0.208655	0.002454	0.614333	0.002458
comp.windows.x.66408	0.000080	0.000080	0.809449	0.163054	0.000080	0.027097	0.000080	0.000080

저작자표시 비영리

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

8-8. 문서 유사도 (0)	2022.01.02
8-7. Opinion Review \|\| 문서 군집화 (0)	2022.01.02
8-5. Sentiment Analysis \|\| SentiWordNet (0)	2022.01.02
8-4. IMDB 영화평 \|\| 지도학습 기반 감성 분석 (0)	2022.01.02
8-3. 뉴스그룹 분류 (0)	2022.01.02

8-5. Sentiment Analysis || SentiWordNet

2022. 1. 2. 19:31

728x90

비지도학습 기반 감성 분석

SentiWordNet을 이용한 Sentiment Analysis

WordNet Synset과 SentiWordNet SentiSynset 클래스의 이해

import nltk
nltk.download('all')



[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\brown.zip.
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\cess_esp.zip.
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\chat80.zip.
[nltk_data]    | Downloading package city_database to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\city_database.zip.
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\cmudict.zip.
[nltk_data]    | Downloading package comparative_sentences to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\comparative_sentences.zip.
[nltk_data]    | Downloading package comtrans to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\conll2000.zip.
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\conll2002.zip.
[nltk_data]    | Downloading package conll2007 to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package crubadan to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\crubadan.zip.
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\dependency_treebank.zip.
[nltk_data]    | Downloading package dolch to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\dolch.zip.
[nltk_data]    | Downloading package europarl_raw to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\europarl_raw.zip.
[nltk_data]    | Downloading package floresta to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\floresta.zip.
[nltk_data]    | Downloading package framenet_v15 to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\framenet_v15.zip.
[nltk_data]    | Downloading package framenet_v17 to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\framenet_v17.zip.
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\gazetteers.zip.
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\genesis.zip.
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\gutenberg.zip.
[nltk_data]    | Downloading package ieer to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\ieer.zip.
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\inaugural.zip.
[nltk_data]    | Downloading package indian to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\indian.zip.
[nltk_data]    | Downloading package jeita to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package kimmo to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\kimmo.zip.
[nltk_data]    | Downloading package knbc to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package lin_thesaurus to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\lin_thesaurus.zip.
[nltk_data]    | Downloading package mac_morpho to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\mac_morpho.zip.
[nltk_data]    | Downloading package machado to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package masc_tagged to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package moses_sample to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping models\moses_sample.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\movie_reviews.zip.
[nltk_data]    | Downloading package names to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\names.zip.
[nltk_data]    | Downloading package nombank.1.0 to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package nps_chat to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\nps_chat.zip.
[nltk_data]    | Downloading package omw to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\omw.zip.
[nltk_data]    | Downloading package opinion_lexicon to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\opinion_lexicon.zip.
[nltk_data]    | Downloading package paradigms to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\paradigms.zip.
[nltk_data]    | Downloading package pil to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\pil.zip.
[nltk_data]    | Downloading package pl196x to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\pl196x.zip.
[nltk_data]    | Downloading package ppattach to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\ppattach.zip.
[nltk_data]    | Downloading package problem_reports to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\problem_reports.zip.
[nltk_data]    | Downloading package propbank to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package ptb to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\ptb.zip.
[nltk_data]    | Downloading package product_reviews_1 to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\product_reviews_1.zip.
[nltk_data]    | Downloading package product_reviews_2 to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\product_reviews_2.zip.
[nltk_data]    | Downloading package pros_cons to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\pros_cons.zip.
[nltk_data]    | Downloading package qc to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\qc.zip.
[nltk_data]    | Downloading package reuters to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package rte to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\rte.zip.
[nltk_data]    | Downloading package semcor to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package senseval to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\senseval.zip.
[nltk_data]    | Downloading package sentiwordnet to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\sentiwordnet.zip.
[nltk_data]    | Downloading package sentence_polarity to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\sentence_polarity.zip.
[nltk_data]    | Downloading package shakespeare to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\shakespeare.zip.
[nltk_data]    | Downloading package sinica_treebank to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\sinica_treebank.zip.
[nltk_data]    | Downloading package smultron to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\smultron.zip.
[nltk_data]    | Downloading package state_union to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\state_union.zip.
[nltk_data]    | Downloading package stopwords to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\stopwords.zip.
[nltk_data]    | Downloading package subjectivity to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\subjectivity.zip.
[nltk_data]    | Downloading package swadesh to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\swadesh.zip.
[nltk_data]    | Downloading package switchboard to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\switchboard.zip.
[nltk_data]    | Downloading package timit to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\timit.zip.
[nltk_data]    | Downloading package toolbox to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\toolbox.zip.
[nltk_data]    | Downloading package treebank to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\treebank.zip.
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\twitter_samples.zip.
[nltk_data]    | Downloading package udhr to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\udhr.zip.
[nltk_data]    | Downloading package udhr2 to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\udhr2.zip.
[nltk_data]    | Downloading package unicode_samples to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\unicode_samples.zip.
[nltk_data]    | Downloading package universal_treebanks_v20 to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package verbnet to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\verbnet.zip.
[nltk_data]    | Downloading package verbnet3 to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\verbnet3.zip.
[nltk_data]    | Downloading package webtext to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\webtext.zip.
[nltk_data]    | Downloading package wordnet to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\wordnet.zip.
[nltk_data]    | Downloading package wordnet_ic to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\wordnet_ic.zip.
[nltk_data]    | Downloading package words to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\words.zip.
[nltk_data]    | Downloading package ycoe to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\ycoe.zip.
[nltk_data]    | Downloading package rslp to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping stemmers\rslp.zip.
[nltk_data]    | Downloading package maxent_treebank_pos_tagger to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping taggers\maxent_treebank_pos_tagger.zip.
[nltk_data]    | Downloading package universal_tagset to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping taggers\universal_tagset.zip.
[nltk_data]    | Downloading package maxent_ne_chunker to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping chunkers\maxent_ne_chunker.zip.
[nltk_data]    | Downloading package punkt to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping tokenizers\punkt.zip.
[nltk_data]    | Downloading package book_grammars to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping grammars\book_grammars.zip.
[nltk_data]    | Downloading package sample_grammars to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping grammars\sample_grammars.zip.
[nltk_data]    | Downloading package spanish_grammars to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping grammars\spanish_grammars.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping grammars\basque_grammars.zip.
[nltk_data]    | Downloading package large_grammars to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping grammars\large_grammars.zip.
[nltk_data]    | Downloading package tagsets to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping help\tagsets.zip.
[nltk_data]    | Downloading package snowball_data to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package bllip_wsj_no_aux to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping models\bllip_wsj_no_aux.zip.
[nltk_data]    | Downloading package word2vec_sample to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping models\word2vec_sample.zip.
[nltk_data]    | Downloading package panlex_swadesh to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package mte_teip5 to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\mte_teip5.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers\averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package perluniprops to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping misc\perluniprops.zip.
[nltk_data]    | Downloading package nonbreaking_prefixes to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\nonbreaking_prefixes.zip.
[nltk_data]    | Downloading package vader_lexicon to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package porter_test to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping stemmers\porter_test.zip.
[nltk_data]    | Downloading package wmt15_eval to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping models\wmt15_eval.zip.
[nltk_data]    | Downloading package mwa_ppdb to
[nltk_data]    |     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping misc\mwa_ppdb.zip.
[nltk_data]    | 
[nltk_data]  Done downloading collection all
True

from nltk.corpus import wordnet as wn

term = 'present'

# 'present'라는 단어로 wordnet의 synsets 생성. 
synsets = wn.synsets(term)
print('synsets() 반환 type :', type(synsets))
print('synsets() 반환 값 갯수:', len(synsets))
print('synsets() 반환 값 :', synsets)


synsets() 반환 type : <class 'list'>
synsets() 반환 값 갯수: 18
synsets() 반환 값 : [Synset('present.n.01'), Synset('present.n.02'), Synset('present.n.03'), Synset('show.v.01'), Synset('present.v.02'), Synset('stage.v.01'), Synset('present.v.04'), Synset('present.v.05'), Synset('award.v.01'), Synset('give.v.08'), Synset('deliver.v.01'), Synset('introduce.v.01'), Synset('portray.v.04'), Synset('confront.v.03'), Synset('present.v.12'), Synset('salute.v.06'), Synset('present.a.01'), Synset('present.a.02')]

for synset in synsets :
    print('##### Synset name : ', synset.name(),'#####')
    print('POS :',synset.lexname())
    print('Definition:',synset.definition())
    print('Lemmas:',synset.lemma_names())
    
    
##### Synset name :  present.n.01 #####
POS : noun.time
Definition: the period of time that is happening now; any continuous stretch of time including the moment of speech
Lemmas: ['present', 'nowadays']
##### Synset name :  present.n.02 #####
POS : noun.possession
Definition: something presented as a gift
Lemmas: ['present']
##### Synset name :  present.n.03 #####
POS : noun.communication
Definition: a verb tense that expresses actions or states at the time of speaking
Lemmas: ['present', 'present_tense']
##### Synset name :  show.v.01 #####
POS : verb.perception
Definition: give an exhibition of to an interested audience
Lemmas: ['show', 'demo', 'exhibit', 'present', 'demonstrate']
##### Synset name :  present.v.02 #####
POS : verb.communication
Definition: bring forward and present to the mind
Lemmas: ['present', 'represent', 'lay_out']
##### Synset name :  stage.v.01 #####
POS : verb.creation
Definition: perform (a play), especially on a stage
Lemmas: ['stage', 'present', 'represent']
##### Synset name :  present.v.04 #####
POS : verb.possession
Definition: hand over formally
Lemmas: ['present', 'submit']
##### Synset name :  present.v.05 #####
POS : verb.stative
Definition: introduce
Lemmas: ['present', 'pose']
##### Synset name :  award.v.01 #####
POS : verb.possession
Definition: give, especially as an honor or reward
Lemmas: ['award', 'present']
##### Synset name :  give.v.08 #####
POS : verb.possession
Definition: give as a present; make a gift of
Lemmas: ['give', 'gift', 'present']
##### Synset name :  deliver.v.01 #####
POS : verb.communication
Definition: deliver (a speech, oration, or idea)
Lemmas: ['deliver', 'present']
##### Synset name :  introduce.v.01 #####
POS : verb.communication
Definition: cause to come to know personally
Lemmas: ['introduce', 'present', 'acquaint']
##### Synset name :  portray.v.04 #####
POS : verb.creation
Definition: represent abstractly, for example in a painting, drawing, or sculpture
Lemmas: ['portray', 'present']
##### Synset name :  confront.v.03 #####
POS : verb.communication
Definition: present somebody with something, usually to accuse or criticize
Lemmas: ['confront', 'face', 'present']
##### Synset name :  present.v.12 #####
POS : verb.communication
Definition: formally present a debutante, a representative of a country, etc.
Lemmas: ['present']
##### Synset name :  salute.v.06 #####
POS : verb.communication
Definition: recognize with a gesture prescribed by a military regulation; assume a prescribed position
Lemmas: ['salute', 'present']
##### Synset name :  present.a.01 #####
POS : adj.all
Definition: temporal sense; intermediate between past and future; now existing or happening or in consideration
Lemmas: ['present']
##### Synset name :  present.a.02 #####
POS : adj.all
Definition: being or existing in a specified place
Lemmas: ['present']

# synset 객체를 단어별로 생성합니다. 
tree = wn.synset('tree.n.01')
lion = wn.synset('lion.n.01')
tiger = wn.synset('tiger.n.02')
cat = wn.synset('cat.n.01')
dog = wn.synset('dog.n.01')

entities = [tree , lion , tiger , cat , dog]
similarities = []
entity_names = [ entity.name().split('.')[0] for entity in entities]

# 단어별 synset 들을 iteration 하면서 다른 단어들의 synset과 유사도를 측정합니다. 
for entity in entities:
    similarity = [ round(entity.path_similarity(compared_entity), 2)  for compared_entity in entities ]
    similarities.append(similarity)
    
# 개별 단어별 synset과 다른 단어의 synset과의 유사도를 DataFrame형태로 저장합니다.  
similarity_df = pd.DataFrame(similarities , columns=entity_names,index=entity_names)
similarity_df


	tree	lion	tiger	cat	dog
tree	1.00	0.07	0.07	0.08	0.12
lion	0.07	1.00	0.33	0.25	0.17
tiger	0.07	0.33	1.00	0.25	0.17
cat	0.08	0.25	0.25	1.00	0.20
dog	0.12	0.17	0.17	0.20	1.00

import nltk
from nltk.corpus import sentiwordnet as swn

senti_synsets = list(swn.senti_synsets('slow'))
print('senti_synsets() 반환 type :', type(senti_synsets))
print('senti_synsets() 반환 값 갯수:', len(senti_synsets))
print('senti_synsets() 반환 값 :', senti_synsets)



senti_synsets() 반환 type : <class 'list'>
senti_synsets() 반환 값 갯수: 11
senti_synsets() 반환 값 : [SentiSynset('decelerate.v.01'), SentiSynset('slow.v.02'), SentiSynset('slow.v.03'), SentiSynset('slow.a.01'), SentiSynset('slow.a.02'), SentiSynset('dense.s.04'), SentiSynset('slow.a.04'), SentiSynset('boring.s.01'), SentiSynset('dull.s.08'), SentiSynset('slowly.r.01'), SentiSynset('behind.r.03')]

import nltk
from nltk.corpus import sentiwordnet as swn

father = swn.senti_synset('father.n.01')
print('father 긍정감성 지수: ', father.pos_score())
print('father 부정감성 지수: ', father.neg_score())
print('father 객관성 지수: ', father.obj_score())
print('\n')
fabulous = swn.senti_synset('fabulous.a.01')
print('fabulous 긍정감성 지수: ',fabulous .pos_score())
print('fabulous 부정감성 지수: ',fabulous .neg_score())


father 긍정감성 지수:  0.0
father 부정감성 지수:  0.0
father 객관성 지수:  1.0


fabulous 긍정감성 지수:  0.875
fabulous 부정감성 지수:  0.125

from nltk.corpus import wordnet as wn

# 간단한 NTLK PennTreebank Tag를 기반으로 WordNet기반의 품사 Tag로 변환
def penn_to_wn(tag):
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB
    return

from nltk.stem import WordNetLemmatizer
from nltk.corpus import sentiwordnet as swn
from nltk import sent_tokenize, word_tokenize, pos_tag

def swn_polarity(text):
    # 감성 지수 초기화 
    sentiment = 0.0
    tokens_count = 0
    
    lemmatizer = WordNetLemmatizer()
    raw_sentences = sent_tokenize(text)
    # 분해된 문장별로 단어 토큰 -> 품사 태깅 후에 SentiSynset 생성 -> 감성 지수 합산 
    for raw_sentence in raw_sentences:
        # NTLK 기반의 품사 태깅 문장 추출  
        tagged_sentence = pos_tag(word_tokenize(raw_sentence))
        for word , tag in tagged_sentence:
            
            # WordNet 기반 품사 태깅과 어근 추출
            wn_tag = penn_to_wn(tag)
            if wn_tag not in (wn.NOUN , wn.ADJ, wn.ADV):
                continue                   
            lemma = lemmatizer.lemmatize(word, pos=wn_tag)
            if not lemma:
                continue
            # 어근을 추출한 단어와 WordNet 기반 품사 태깅을 입력해 Synset 객체를 생성. 
            synsets = wn.synsets(lemma , pos=wn_tag)
            if not synsets:
                continue
            # sentiwordnet의 감성 단어 분석으로 감성 synset 추출
            # 모든 단어에 대해 긍정 감성 지수는 +로 부정 감성 지수는 -로 합산해 감성 지수 계산. 
            synset = synsets[0]
            swn_synset = swn.senti_synset(synset.name())
            sentiment += (swn_synset.pos_score() - swn_synset.neg_score())           
            tokens_count += 1
    
    if not tokens_count:
        return 0
    
    # 총 score가 0 이상일 경우 긍정(Positive) 1, 그렇지 않을 경우 부정(Negative) 0 반환
    if sentiment >= 0 :
        return 1
    
    return 0

review_df['preds'] = review_df['review'].apply( lambda x : swn_polarity(x) )
y_target = review_df['sentiment'].values
preds = review_df['preds'].values

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score 
from sklearn.metrics import recall_score, f1_score, roc_auc_score

print(confusion_matrix( y_target, preds))
print("정확도:", accuracy_score(y_target , preds))
print("정밀도:", precision_score(y_target , preds))
print("재현율:", recall_score(y_target, preds))



[[7668 4832]
 [3636 8864]]
정확도: 0.66128
정밀도: 0.647196261682243
재현율: 0.70912

VADER lexicon을 이용한 Sentiment Analysis

from nltk.sentiment.vader import SentimentIntensityAnalyzer

senti_analyzer = SentimentIntensityAnalyzer()
senti_scores = senti_analyzer.polarity_scores(review_df['review'][0])
print(senti_scores)


# {'neg': 0.13, 'neu': 0.743, 'pos': 0.127, 'compound': -0.7943}

def vader_polarity(review,threshold=0.1):
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)
    
    # compound 값에 기반하여 threshold 입력값보다 크면 1, 그렇지 않으면 0을 반환 
    agg_score = scores['compound']
    final_sentiment = 1 if agg_score >= threshold else 0
    return final_sentiment

# apply lambda 식을 이용하여 레코드별로 vader_polarity( )를 수행하고 결과를 'vader_preds'에 저장
review_df['vader_preds'] = review_df['review'].apply( lambda x : vader_polarity(x, 0.1) )
y_target = review_df['sentiment'].values
vader_preds = review_df['vader_preds'].values

print('#### VADER 예측 성능 평가 ####')
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score 
from sklearn.metrics import recall_score, f1_score, roc_auc_score

print(confusion_matrix( y_target, vader_preds))
print("정확도:", accuracy_score(y_target , vader_preds))
print("정밀도:", precision_score(y_target , vader_preds))
print("재현율:", recall_score(y_target, vader_preds))


#### VADER 예측 성능 평가 ####
[[ 6736  5764]
 [ 1867 10633]]
정확도: 0.69476
정밀도: 0.6484722815149113
재현율: 0.85064

저작자표시 비영리

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

8-7. Opinion Review \|\| 문서 군집화 (0)	2022.01.02
8-6. 토픽 모델링(Topic Modeling) - 20 뉴스그룹 (0)	2022.01.02
8-4. IMDB 영화평 \|\| 지도학습 기반 감성 분석 (0)	2022.01.02
8-3. 뉴스그룹 분류 (0)	2022.01.02
8-2. Bag of Words (0)	2022.01.02

8-4. IMDB 영화평 || 지도학습 기반 감성 분석

2022. 1. 2. 19:26

728x90

labeledTrainData.tsv.zip

12.96MB

지도학습 기반 감성 분석 실습

import pandas as pd

review_df = pd.read_csv('./labeledTrainData.tsv', header=0, sep="\t", quoting=3)
review_df.head(3)


id	sentiment	review
0	"5814_8"	1	"With all this stuff going down at the moment ...
1	"2381_9"	1	"\"The Classic War of the Worlds\" by Timothy ...
2	"7759_3"	0	"The film starts with a manager (Nicholas Bell...

print(review_df['review'][0])



"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."

데이터 사전 처리 html태그 제거 및 숫자문자 제거

import re

# <br> html 태그는 replace 함수로 공백으로 변환
review_df['review'] = review_df['review'].str.replace('<br />',' ')

# 파이썬의 정규 표현식 모듈인 re를 이용하여 영어 문자열이 아닌 문자는 모두 공백으로 변환 
review_df['review'] = review_df['review'].apply( lambda x : re.sub("[^a-zA-Z]", " ", x) )

학습/테스트 데이터 분리

from sklearn.model_selection import train_test_split

class_df = review_df['sentiment']
feature_df = review_df.drop(['id','sentiment'], axis=1, inplace=False)

X_train, X_test, y_train, y_test= train_test_split(feature_df, class_df, test_size=0.3, random_state=156)

X_train.shape, X_test.shape

# ((17500, 1), (7500, 1))

Pipeline을 통해 Count기반 피처 벡터화 및 머신러닝 학습/예측/평가

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

# 스톱 워드는 English, filtering, ngram은 (1,2)로 설정해 CountVectorization수행. 
# LogisticRegression의 C는 10으로 설정. 
pipeline = Pipeline([
    ('cnt_vect', CountVectorizer(stop_words='english', ngram_range=(1,2) )),
    ('lr_clf', LogisticRegression(C=10))])

# Pipeline 객체를 이용하여 fit(), predict()로 학습/예측 수행. predict_proba()는 roc_auc때문에 수행.  
pipeline.fit(X_train['review'], y_train)
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:,1]

print('예측 정확도는 {0:.4f}, ROC-AUC는 {1:.4f}'.format(accuracy_score(y_test ,pred),
                                         roc_auc_score(y_test, pred_probs)))
                                         
                                         
                                         
# 예측 정확도는 0.8860, ROC-AUC는 0.9503

Pipeline을 통해 TF-IDF기반 피처 벡터화 및 머신러닝 학습/예측/평가

# 스톱 워드는 english, filtering, ngram은 (1,2)로 설정해 TF-IDF 벡터화 수행. 
# LogisticRegression의 C는 10으로 설정. 
pipeline = Pipeline([
    ('tfidf_vect', TfidfVectorizer(stop_words='english', ngram_range=(1,2) )),
    ('lr_clf', LogisticRegression(C=10))])

pipeline.fit(X_train['review'], y_train)
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:,1]

print('예측 정확도는 {0:.4f}, ROC-AUC는 {1:.4f}'.format(accuracy_score(y_test ,pred),
                                         roc_auc_score(y_test, pred_probs)))
                                         
                                         
                                         
# 예측 정확도는 0.8936, ROC-AUC는 0.9598

저작자표시 비영리

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

8-6. 토픽 모델링(Topic Modeling) - 20 뉴스그룹 (0)	2022.01.02
8-5. Sentiment Analysis \|\| SentiWordNet (0)	2022.01.02
8-3. 뉴스그룹 분류 (0)	2022.01.02
8-2. Bag of Words (0)	2022.01.02
8-1.텍스트 전처리 (0)	2022.01.02

8-3. 뉴스그룹 분류

2022. 1. 2. 19:16

728x90

from sklearn.datasets import fetch_20newsgroups

news_data = fetch_20newsgroups(subset='all',random_state=156)

print(news_data.keys())

# dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])

import pandas as pd

print('target 클래스의 값과 분포도 \n',pd.Series(news_data.target).value_counts().sort_index())
print('target 클래스의 이름들 \n',news_data.target_names)
len(news_data.target_names), pd.Series(news_data.target).shape



target 클래스의 값과 분포도 
 0     799
1     973
2     985
3     982
4     963
5     988
6     975
7     990
8     996
9     994
10    999
11    991
12    984
13    990
14    987
15    997
16    910
17    940
18    775
19    628
dtype: int64
target 클래스의 이름들 
 ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
(20, (18846,))

print(news_data.data[0])


From: egreen@east.sun.com (Ed Green - Pixel Cruncher)
Subject: Re: Observation re: helmets
Organization: Sun Microsystems, RTP, NC
Lines: 21
Distribution: world
Reply-To: egreen@east.sun.com
NNTP-Posting-Host: laser.east.sun.com

In article 211353@mavenry.altcit.eskimo.com, maven@mavenry.altcit.eskimo.com (Norman Hamer) writes:
> 
> The question for the day is re: passenger helmets, if you don't know for 
>certain who's gonna ride with you (like say you meet them at a .... church 
>meeting, yeah, that's the ticket)... What are some guidelines? Should I just 
>pick up another shoei in my size to have a backup helmet (XL), or should I 
>maybe get an inexpensive one of a smaller size to accomodate my likely 
>passenger? 

If your primary concern is protecting the passenger in the event of a
crash, have him or her fitted for a helmet that is their size.  If your
primary concern is complying with stupid helmet laws, carry a real big
spare (you can put a big or small head in a big helmet, but not in a
small one).

---
Ed Green, former Ninjaite |I was drinking last night with a biker,
  Ed.Green@East.Sun.COM   |and I showed him a picture of you.  I said,
DoD #0111  (919)460-8302  |"Go on, get to know her, you'll like her!"
 (The Grateful Dead) -->  |It seemed like the least I could do...

학습과 테스트용 데이터 생성

from sklearn.datasets import fetch_20newsgroups

# subset='train'으로 학습용(Train) 데이터만 추출, remove=('headers', 'footers', 'quotes')로 내용만 추출
train_news= fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), random_state=156)
X_train = train_news.data
y_train = train_news.target
print(type(X_train))

# subset='test'으로 테스트(Test) 데이터만 추출, remove=('headers', 'footers', 'quotes')로 내용만 추출
test_news= fetch_20newsgroups(subset='test',remove=('headers', 'footers','quotes'),random_state=156)
X_test = test_news.data
y_test = test_news.target
print('학습 데이터 크기 {0} , 테스트 데이터 크기 {1}'.format(len(train_news.data) , len(test_news.data)))



<class 'list'>
학습 데이터 크기 11314 , 테스트 데이터 크기 7532

Count 피처 벡터화 변환과 머신러닝 모델 학습/예측/평가

주의: 학습 데이터에 대해 fit( )된 CountVectorizer를 이용해서 테스트 데이터를 피처 벡터화 해야함.
테스트 데이터에서 다시 CountVectorizer의 fit_transform()을 수행하거나 fit()을 수행 하면 안됨.
이는 이렇게 테스트 데이터에서 fit()을 수행하게 되면 기존 학습된 모델에서 가지는 feature의 갯수가 달라지기 때문임.

from sklearn.feature_extraction.text import CountVectorizer

# Count Vectorization으로 feature extraction 변환 수행. 
cnt_vect = CountVectorizer()
cnt_vect.fit(X_train)
X_train_cnt_vect = cnt_vect.transform(X_train)

# 학습 데이터로 fit( )된 CountVectorizer를 이용하여 테스트 데이터를 feature extraction 변환 수행. 
X_test_cnt_vect = cnt_vect.transform(X_test)

print('학습 데이터 Text의 CountVectorizer Shape:',X_train_cnt_vect.shape, X_test_cnt_vect.shape)



# 학습 데이터 Text의 CountVectorizer Shape: (11314, 101631) (7532, 101631)

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# LogisticRegression을 이용하여 학습/예측/평가 수행. 
lr_clf = LogisticRegression()
lr_clf.fit(X_train_cnt_vect , y_train)
pred = lr_clf.predict(X_test_cnt_vect)
print('CountVectorized Logistic Regression 의 예측 정확도는 {0:.3f}'.format(accuracy_score(y_test,pred)))



# CountVectorized Logistic Regression 의 예측 정확도는 0.617

TF-IDF 피처 변환과 머신러닝 학습/예측/평가

from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF Vectorization 적용하여 학습 데이터셋과 테스트 데이터 셋 변환. 
tfidf_vect = TfidfVectorizer()
tfidf_vect.fit(X_train)
X_train_tfidf_vect = tfidf_vect.transform(X_train)
X_test_tfidf_vect = tfidf_vect.transform(X_test)

# LogisticRegression을 이용하여 학습/예측/평가 수행. 
lr_clf = LogisticRegression()
lr_clf.fit(X_train_tfidf_vect , y_train)
pred = lr_clf.predict(X_test_tfidf_vect)
print('TF-IDF Logistic Regression 의 예측 정확도는 {0:.3f}'.format(accuracy_score(y_test ,pred)))



# TF-IDF Logistic Regression 의 예측 정확도는 0.678

stop words 필터링을 추가하고 ngram을 기본(1,1)에서 (1,2)로 변경하여 피처 벡터화

# stop words 필터링을 추가하고 ngram을 기본(1,1)에서 (1,2)로 변경하여 Feature Vectorization 적용.
tfidf_vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2), max_df=300 )
tfidf_vect.fit(X_train)
X_train_tfidf_vect = tfidf_vect.transform(X_train)
X_test_tfidf_vect = tfidf_vect.transform(X_test)

lr_clf = LogisticRegression()
lr_clf.fit(X_train_tfidf_vect , y_train)
pred = lr_clf.predict(X_test_tfidf_vect)
print('TF-IDF Vectorized Logistic Regression 의 예측 정확도는 {0:.3f}'.format(accuracy_score(y_test ,pred)))


# TF-IDF Vectorized Logistic Regression 의 예측 정확도는 0.690

GridSearchCV로 LogisticRegression C 하이퍼 파라미터 튜닝

from sklearn.model_selection import GridSearchCV

# 최적 C 값 도출 튜닝 수행. CV는 3 Fold셋으로 설정. 
params = { 'C':[0.01, 0.1, 1, 5, 10]}
grid_cv_lr = GridSearchCV(lr_clf ,param_grid=params , cv=3 , scoring='accuracy' , verbose=1 )
grid_cv_lr.fit(X_train_tfidf_vect , y_train)
print('Logistic Regression best C parameter :',grid_cv_lr.best_params_ )

# 최적 C 값으로 학습된 grid_cv로 예측 수행하고 정확도 평가. 
pred = grid_cv_lr.predict(X_test_tfidf_vect)
print('TF-IDF Vectorized Logistic Regression 의 예측 정확도는 {0:.3f}'.format(accuracy_score(y_test ,pred)))



Logistic Regression best C parameter : {'C': 10}
TF-IDF Vectorized Logistic Regression 의 예측 정확도는 0.704

사이킷런 파이프라인(Pipeline) 사용 및 GridSearchCV와의 결합

from sklearn.pipeline import Pipeline

# TfidfVectorizer 객체를 tfidf_vect 객체명으로, LogisticRegression객체를 lr_clf 객체명으로 생성하는 Pipeline생성
pipeline = Pipeline([
    ('tfidf_vect', TfidfVectorizer(stop_words='english', ngram_range=(1,2), max_df=300)),
    ('lr_clf', LogisticRegression(C=10))
])

# 별도의 TfidfVectorizer객체의 fit_transform( )과 LogisticRegression의 fit(), predict( )가 필요 없음. 
# pipeline의 fit( ) 과 predict( ) 만으로 한꺼번에 Feature Vectorization과 ML 학습/예측이 가능. 
pipeline.fit(X_train, y_train)
pred = pipeline.predict(X_test)
print('Pipeline을 통한 Logistic Regression 의 예측 정확도는 {0:.3f}'.format(accuracy_score(y_test ,pred)))


# Pipeline을 통한 Logistic Regression 의 예측 정확도는 0.704

from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('tfidf_vect', TfidfVectorizer(stop_words='english')),
    ('lr_clf', LogisticRegression())
])

# Pipeline에 기술된 각각의 객체 변수에 언더바(_)2개를 연달아 붙여 GridSearchCV에 사용될 
# 파라미터/하이퍼 파라미터 이름과 값을 설정. . 
params = { 'tfidf_vect__ngram_range': [(1,1), (1,2), (1,3)],
           'tfidf_vect__max_df': [100, 300, 700],
           'lr_clf__C': [1,5,10]
}

# GridSearchCV의 생성자에 Estimator가 아닌 Pipeline 객체 입력
grid_cv_pipe = GridSearchCV(pipeline, param_grid=params, cv=3 , scoring='accuracy',verbose=1)
grid_cv_pipe.fit(X_train , y_train)
print(grid_cv_pipe.best_params_ , grid_cv_pipe.best_score_)

pred = grid_cv_pipe.predict(X_test)
print('Pipeline을 통한 Logistic Regression 의 예측 정확도는 {0:.3f}'.format(accuracy_score(y_test ,pred)))



{'lr_clf__C': 10, 'tfidf_vect__max_df': 700, 'tfidf_vect__ngram_range': (1, 2)} 0.755524129397207
Pipeline을 통한 Logistic Regression 의 예측 정확도는 0.702

저작자표시 비영리

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

8-5. Sentiment Analysis \|\| SentiWordNet (0)	2022.01.02
8-4. IMDB 영화평 \|\| 지도학습 기반 감성 분석 (0)	2022.01.02
8-2. Bag of Words (0)	2022.01.02
8-1.텍스트 전처리 (0)	2022.01.02
7-10. 고객 세그맨테이션 \|\| clustering (0)	2021.12.30

8-2. Bag of Words

2022. 1. 2. 19:12

728x90

사이킷런 CountVectorizer 테스트

text_sample_01 = 'The Matrix is everywhere its all around us, here even in this room. \
                  You can see it out your window or on your television. \
                  You feel it when you go to work, or go to church or pay your taxes.'
text_sample_02 = 'You take the blue pill and the story ends.  You wake in your bed and you believe whatever you want to believe\
                  You take the red pill and you stay in Wonderland and I show you how deep the rabbit-hole goes.'
text=[]
text.append(text_sample_01); text.append(text_sample_02)
print(text,"\n", len(text))



['The Matrix is everywhere its all around us, here even in this room.                   You can see it out your window or on your television.                   You feel it when you go to work, or go to church or pay your taxes.', 'You take the blue pill and the story ends.  You wake in your bed and you believe whatever you want to believe                  You take the red pill and you stay in Wonderland and I show you how deep the rabbit-hole goes.'] 
 2

CountVectorizer객체 생성 후 fit(), transform()으로 텍스트에 대한 feature vectorization 수행

from sklearn.feature_extraction.text import CountVectorizer

# Count Vectorization으로 feature extraction 변환 수행. 
cnt_vect = CountVectorizer()
cnt_vect.fit(text)


CountVectorizer()

ftr_vect = cnt_vect.transform(text)

피처 벡터화 후 데이터 유형 및 여러 속성 확인

print(type(ftr_vect), ftr_vect.shape)
print(ftr_vect)


<class 'scipy.sparse.csr.csr_matrix'> (2, 51)
  (0, 0)	1
  (0, 2)	1
  (0, 6)	1
  (0, 7)	1
  (0, 10)	1
  (0, 11)	1
  (0, 12)	1
  (0, 13)	2
  (0, 15)	1
  (0, 18)	1
  (0, 19)	1
  (0, 20)	2
  (0, 21)	1
  (0, 22)	1
  (0, 23)	1
  (0, 24)	3
  (0, 25)	1
  (0, 26)	1
  (0, 30)	1
  (0, 31)	1
  (0, 36)	1
  (0, 37)	1
  (0, 38)	1
  (0, 39)	1
  (0, 40)	2
  :	:
  (1, 1)	4
  (1, 3)	1
  (1, 4)	2
  (1, 5)	1
  (1, 8)	1
  (1, 9)	1
  (1, 14)	1
  (1, 16)	1
  (1, 17)	1
  (1, 18)	2
  (1, 27)	2
  (1, 28)	1
  (1, 29)	1
  (1, 32)	1
  (1, 33)	1
  (1, 34)	1
  (1, 35)	2
  (1, 38)	4
  (1, 40)	1
  (1, 42)	1
  (1, 43)	1
  (1, 44)	1
  (1, 47)	1
  (1, 49)	7
  (1, 50)	1

print(cnt_vect.vocabulary_)



{'the': 38, 'matrix': 22, 'is': 19, 'everywhere': 11, 'its': 21, 'all': 0, 'around': 2, 'us': 41, 'here': 15, 'even': 10, 'in': 18, 'this': 39, 'room': 30, 'you': 49, 'can': 6, 'see': 31, 'it': 20, 'out': 25, 'your': 50, 'window': 46, 'or': 24, 'on': 23, 'television': 37, 'feel': 12, 'when': 45, 'go': 13, 'to': 40, 'work': 48, 'church': 7, 'pay': 26, 'taxes': 36, 'take': 35, 'blue': 5, 'pill': 27, 'and': 1, 'story': 34, 'ends': 9, 'wake': 42, 'bed': 3, 'believe': 4, 'whatever': 44, 'want': 43, 'red': 29, 'stay': 33, 'wonderland': 47, 'show': 32, 'how': 17, 'deep': 8, 'rabbit': 28, 'hole': 16, 'goes': 14}

cnt_vect = CountVectorizer(max_features=5, stop_words='english')
cnt_vect.fit(text)
ftr_vect = cnt_vect.transform(text)
print(type(ftr_vect), ftr_vect.shape)
print(cnt_vect.vocabulary_)


<class 'scipy.sparse.csr.csr_matrix'> (2, 5)
{'window': 4, 'pill': 1, 'wake': 2, 'believe': 0, 'want': 3}

ngram_range 확인

cnt_vect = CountVectorizer(ngram_range=(1,3))
cnt_vect.fit(text)
ftr_vect = cnt_vect.transform(text)
print(type(ftr_vect), ftr_vect.shape)
print(cnt_vect.vocabulary_)




<class 'scipy.sparse.csr.csr_matrix'> (2, 201)
{'the': 129, 'matrix': 77, 'is': 66, 'everywhere': 40, 'its': 74, 'all': 0, 'around': 11, 'us': 150, 'here': 51, 'even': 37, 'in': 59, 'this': 140, 'room': 106, 'you': 174, 'can': 25, 'see': 109, 'it': 69, 'out': 90, 'your': 193, 'window': 165, 'or': 83, 'on': 80, 'television': 126, 'feel': 43, 'when': 162, 'go': 46, 'to': 143, 'work': 171, 'church': 28, 'pay': 93, 'taxes': 125, 'the matrix': 132, 'matrix is': 78, 'is everywhere': 67, 'everywhere its': 41, 'its all': 75, 'all around': 1, 'around us': 12, 'us here': 151, 'here even': 52, 'even in': 38, 'in this': 60, 'this room': 141, 'room you': 107, 'you can': 177, 'can see': 26, 'see it': 110, 'it out': 70, 'out your': 91, 'your window': 199, 'window or': 166, 'or on': 86, 'on your': 81, 'your television': 197, 'television you': 127, 'you feel': 179, 'feel it': 44, 'it when': 72, 'when you': 163, 'you go': 181, 'go to': 47, 'to work': 148, 'work or': 172, 'or go': 84, 'to church': 146, 'church or': 29, 'or pay': 88, 'pay your': 94, 'your taxes': 196, 'the matrix is': 133, 'matrix is everywhere': 79, 'is everywhere its': 68, 'everywhere its all': 42, 'its all around': 76, 'all around us': 2, 'around us here': 13, 'us here even': 152, 'here even in': 53, 'even in this': 39, 'in this room': 61, 'this room you': 142, 'room you can': 108, 'you can see': 178, 'can see it': 27, 'see it out': 111, 'it out your': 71, 'out your window': 92, 'your window or': 200, 'window or on': 167, 'or on your': 87, 'on your television': 82, 'your television you': 198, 'television you feel': 128, 'you feel it': 180, 'feel it when': 45, 'it when you': 73, 'when you go': 164, 'you go to': 182, 'go to work': 49, 'to work or': 149, 'work or go': 173, 'or go to': 85, 'go to church': 48, 'to church or': 147, 'church or pay': 30, 'or pay your': 89, 'pay your taxes': 95, 'take': 121, 'blue': 22, 'pill': 96, 'and': 3, 'story': 118, 'ends': 34, 'wake': 153, 'bed': 14, 'believe': 17, 'whatever': 159, 'want': 156, 'red': 103, 'stay': 115, 'wonderland': 168, 'show': 112, 'how': 56, 'deep': 31, 'rabbit': 100, 'hole': 54, 'goes': 50, 'you take': 187, 'take the': 122, 'the blue': 130, 'blue pill': 23, 'pill and': 97, 'and the': 6, 'the story': 138, 'story ends': 119, 'ends you': 35, 'you wake': 189, 'wake in': 154, 'in your': 64, 'your bed': 194, 'bed and': 15, 'and you': 8, 'you believe': 175, 'believe whatever': 18, 'whatever you': 160, 'you want': 191, 'want to': 157, 'to believe': 144, 'believe you': 20, 'the red': 136, 'red pill': 104, 'you stay': 185, 'stay in': 116, 'in wonderland': 62, 'wonderland and': 169, 'and show': 4, 'show you': 113, 'you how': 183, 'how deep': 57, 'deep the': 32, 'the rabbit': 134, 'rabbit hole': 101, 'hole goes': 55, 'you take the': 188, 'take the blue': 123, 'the blue pill': 131, 'blue pill and': 24, 'pill and the': 98, 'and the story': 7, 'the story ends': 139, 'story ends you': 120, 'ends you wake': 36, 'you wake in': 190, 'wake in your': 155, 'in your bed': 65, 'your bed and': 195, 'bed and you': 16, 'and you believe': 9, 'you believe whatever': 176, 'believe whatever you': 19, 'whatever you want': 161, 'you want to': 192, 'want to believe': 158, 'to believe you': 145, 'believe you take': 21, 'take the red': 124, 'the red pill': 137, 'red pill and': 105, 'pill and you': 99, 'and you stay': 10, 'you stay in': 186, 'stay in wonderland': 117, 'in wonderland and': 63, 'wonderland and show': 170, 'and show you': 5, 'show you how': 114, 'you how deep': 184, 'how deep the': 58, 'deep the rabbit': 33, 'the rabbit hole': 135, 'rabbit hole goes': 102}

희소 행렬 - COO 형식

import numpy as np

dense = np.array( [ [ 3, 0, 1 ], 
                    [0, 2, 0 ] ] )

from scipy import sparse

# 0 이 아닌 데이터 추출
data = np.array([3,1,2])

# 행 위치와 열 위치를 각각 array로 생성 
row_pos = np.array([0,0,1])
col_pos = np.array([0,2,1])

# sparse 패키지의 coo_matrix를 이용하여 COO 형식으로 희소 행렬 생성
sparse_coo = sparse.coo_matrix((data, (row_pos,col_pos)))

print(type(sparse_coo))
print(sparse_coo)
dense01=sparse_coo.toarray()
print(type(dense01),"\n", dense01)


<class 'scipy.sparse.coo.coo_matrix'>
  (0, 0)	3
  (0, 2)	1
  (1, 1)	2
<class 'numpy.ndarray'> 
 [[3 0 1]
 [0 2 0]]

희소 행렬 – CSR 형식

from scipy import sparse

dense2 = np.array([[0,0,1,0,0,5],
             [1,4,0,3,2,5],
             [0,6,0,3,0,0],
             [2,0,0,0,0,0],
             [0,0,0,7,0,8],
             [1,0,0,0,0,0]])

# 0 이 아닌 데이터 추출
data2 = np.array([1, 5, 1, 4, 3, 2, 5, 6, 3, 2, 7, 8, 1])

# 행 위치와 열 위치를 각각 array로 생성 
row_pos = np.array([0, 0, 1, 1, 1, 1, 1, 2, 2, 3, 4, 4, 5])
col_pos = np.array([2, 5, 0, 1, 3, 4, 5, 1, 3, 0, 3, 5, 0])

# COO 형식으로 변환 
sparse_coo = sparse.coo_matrix((data2, (row_pos,col_pos)))

# 행 위치 배열의 고유한 값들의 시작 위치 인덱스를 배열로 생성
row_pos_ind = np.array([0, 2, 7, 9, 10, 12, 13])

# CSR 형식으로 변환 
sparse_csr = sparse.csr_matrix((data2, col_pos, row_pos_ind))

print('COO 변환된 데이터가 제대로 되었는지 다시 Dense로 출력 확인')
print(sparse_coo.toarray())
print('CSR 변환된 데이터가 제대로 되었는지 다시 Dense로 출력 확인')
print(sparse_csr.toarray())





COO 변환된 데이터가 제대로 되었는지 다시 Dense로 출력 확인
[[0 0 1 0 0 5]
 [1 4 0 3 2 5]
 [0 6 0 3 0 0]
 [2 0 0 0 0 0]
 [0 0 0 7 0 8]
 [1 0 0 0 0 0]]
CSR 변환된 데이터가 제대로 되었는지 다시 Dense로 출력 확인
[[0 0 1 0 0 5]
 [1 4 0 3 2 5]
 [0 6 0 3 0 0]
 [2 0 0 0 0 0]
 [0 0 0 7 0 8]
 [1 0 0 0 0 0]]

print(sparse_csr)




  (0, 2)	1
  (0, 5)	5
  (1, 0)	1
  (1, 1)	4
  (1, 3)	3
  (1, 4)	2
  (1, 5)	5
  (2, 1)	6
  (2, 3)	3
  (3, 0)	2
  (4, 3)	7
  (4, 5)	8
  (5, 0)	1

dense3 = np.array([[0,0,1,0,0,5],
             [1,4,0,3,2,5],
             [0,6,0,3,0,0],
             [2,0,0,0,0,0],
             [0,0,0,7,0,8],
             [1,0,0,0,0,0]])

coo = sparse.coo_matrix(dense3)
csr = sparse.csr_matrix(dense3)

저작자표시 비영리

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

8-4. IMDB 영화평 \|\| 지도학습 기반 감성 분석 (0)	2022.01.02
8-3. 뉴스그룹 분류 (0)	2022.01.02
8-1.텍스트 전처리 (0)	2022.01.02
7-10. 고객 세그맨테이션 \|\| clustering (0)	2021.12.30
7-9. DBSCAN 2 (0)	2021.12.30

8-1.텍스트 전처리

2022. 1. 2. 19:08

728x90

Text Tokenization

문장 토큰화

from nltk import sent_tokenize
text_sample = 'The Matrix is everywhere its all around us, here even in this room.  \
              You can see it out your window or on your television. \
               You feel it when you go to work, or go to church or pay your taxes.'
sentences = sent_tokenize(text=text_sample)
print(type(sentences),len(sentences))
print(sentences)



<class 'list'> 3
['The Matrix is everywhere its all around us, here even in this room.', 'You can see it out your window or on your television.', 'You feel it when you go to work, or go to church or pay your taxes.']

단어 토큰화

from nltk import word_tokenize

sentence = "The Matrix is everywhere its all around us, here even in this room."
words = word_tokenize(sentence)
print(type(words), len(words))
print(words)



<class 'list'> 15
['The', 'Matrix', 'is', 'everywhere', 'its', 'all', 'around', 'us', ',', 'here', 'even', 'in', 'this', 'room', '.']

여러 문장들에 대한 단어 토큰화

from nltk import word_tokenize, sent_tokenize

#여러개의 문장으로 된 입력 데이터를 문장별로 단어 토큰화 만드는 함수 생성
def tokenize_text(text):
    
    # 문장별로 분리 토큰
    sentences = sent_tokenize(text)
    # 분리된 문장별 단어 토큰화
    word_tokens = [word_tokenize(sentence) for sentence in sentences]
    return word_tokens

#여러 문장들에 대해 문장별 단어 토큰화 수행. 
word_tokens = tokenize_text(text_sample)
print(type(word_tokens),len(word_tokens))
print(word_tokens)



<class 'list'> 3
[['The', 'Matrix', 'is', 'everywhere', 'its', 'all', 'around', 'us', ',', 'here', 'even', 'in', 'this', 'room', '.'], ['You', 'can', 'see', 'it', 'out', 'your', 'window', 'or', 'on', 'your', 'television', '.'], ['You', 'feel', 'it', 'when', 'you', 'go', 'to', 'work', ',', 'or', 'go', 'to', 'church', 'or', 'pay', 'your', 'taxes', '.']]

n-gram

from nltk import ngrams

sentence = "The Matrix is everywhere its all around us, here even in this room."
words = word_tokenize(sentence)

all_ngrams = ngrams(words, 2)
ngrams = [ngram for ngram in all_ngrams]
print(ngrams)



[('The', 'Matrix'), ('Matrix', 'is'), ('is', 'everywhere'), ('everywhere', 'its'), ('its', 'all'), ('all', 'around'), ('around', 'us'), ('us', ','), (',', 'here'), ('here', 'even'), ('even', 'in'), ('in', 'this'), ('this', 'room'), ('room', '.')]

Stopwords 제거

import nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\KwonChulmin\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
True

print('영어 stop words 갯수:',len(nltk.corpus.stopwords.words('english')))
print(nltk.corpus.stopwords.words('english')[:40])



영어 stop words 갯수: 179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this']

import nltk

stopwords = nltk.corpus.stopwords.words('english')
all_tokens = []
# 위 예제의 3개의 문장별로 얻은 word_tokens list 에 대해 stop word 제거 Loop
for sentence in word_tokens:
    filtered_words=[]
    # 개별 문장별로 tokenize된 sentence list에 대해 stop word 제거 Loop
    for word in sentence:
        #소문자로 모두 변환합니다. 
        word = word.lower()
        # tokenize 된 개별 word가 stop words 들의 단어에 포함되지 않으면 word_tokens에 추가
        if word not in stopwords:
            filtered_words.append(word)
    all_tokens.append(filtered_words)
    
print(all_tokens)


[['matrix', 'everywhere', 'around', 'us', ',', 'even', 'room', '.'], ['see', 'window', 'television', '.'], ['feel', 'go', 'work', ',', 'go', 'church', 'pay', 'taxes', '.']]

Stemming과 Lemmatization

from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()

print(stemmer.stem('working'),stemmer.stem('works'),stemmer.stem('worked'))
print(stemmer.stem('amusing'),stemmer.stem('amuses'),stemmer.stem('amused'))
print(stemmer.stem('happier'),stemmer.stem('happiest'))
print(stemmer.stem('fancier'),stemmer.stem('fanciest'))



work work work
amus amus amus
happy happiest
fant fanciest

from nltk.stem import WordNetLemmatizer

lemma = WordNetLemmatizer()
print(lemma.lemmatize('amusing','v'),lemma.lemmatize('amuses','v'),lemma.lemmatize('amused','v'))
print(lemma.lemmatize('happier','a'),lemma.lemmatize('happiest','a'))
print(lemma.lemmatize('fancier','a'),lemma.lemmatize('fanciest','a'))



amuse amuse amuse
happy happy
fancy fancy

저작자표시 비영리

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

8-3. 뉴스그룹 분류 (0)	2022.01.02
8-2. Bag of Words (0)	2022.01.02
7-10. 고객 세그맨테이션 \|\| clustering (0)	2021.12.30
7-9. DBSCAN 2 (0)	2021.12.30
7-8. DBSCAN (0)	2021.12.30

7-10. 고객 세그맨테이션 || clustering

2021. 12. 30. 12:17

728x90

데이터 셋 로딩과 데이터 클린징

RFM 기법

import pandas as pd
import datetime
import math
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

retail_df = pd.read_excel(io='Online Retail.xlsx')
retail_df.head(3)





	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850.0	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	2.75	17850.0	United Kingdom

retail_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null datetime64[ns]
UnitPrice      541909 non-null float64
CustomerID     406829 non-null float64
Country        541909 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB

반품이나 CustomerID가 Null인 데이터는 제외, 영국 이외 국가의 데이터는 제외

retail_df = retail_df[retail_df['Quantity'] > 0]
retail_df = retail_df[retail_df['UnitPrice'] > 0]
retail_df = retail_df[retail_df['CustomerID'].notnull()]
print(retail_df.shape)
retail_df.isnull().sum()


(397884, 8)
InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

retail_df['Country'].value_counts()[:5]


United Kingdom    354321
Germany             9040
France              8341
EIRE                7236
Spain               2484
Name: Country, dtype: int64

retail_df = retail_df[retail_df['Country']=='United Kingdom']
print(retail_df.shape)


# (354321, 8)

RFM 기반 데이터 가공

구매금액 생성

retail_df['sale_amount'] = retail_df['Quantity'] * retail_df['UnitPrice']
retail_df['CustomerID'] = retail_df['CustomerID'].astype(int)

print(retail_df['CustomerID'].value_counts().head(5))
print(retail_df.groupby('CustomerID')['sale_amount'].sum().sort_values(ascending=False)[:5])



17841    7847
14096    5111
12748    4595
14606    2700
15311    2379
Name: CustomerID, dtype: int64
CustomerID
18102    259657.30
17450    194550.79
16446    168472.50
17511     91062.38
16029     81024.84
Name: sale_amount, dtype: float64

retail_df.groupby(['InvoiceNo','StockCode'])['InvoiceNo'].count().mean()


# 1.028702077315023

고객 기준으로 Recency, Frequency, Monetary가공

# DataFrame의 groupby() 의 multiple 연산을 위해 agg() 이용
# Recency는 InvoiceDate 컬럼의 max() 에서 데이터 가공
# Frequency는 InvoiceNo 컬럼의 count() , Monetary value는 sale_amount 컬럼의 sum()
aggregations = {
    'InvoiceDate': 'max',
    'InvoiceNo': 'count',
    'sale_amount':'sum'
}
cust_df = retail_df.groupby('CustomerID').agg(aggregations)
# groupby된 결과 컬럼값을 Recency, Frequency, Monetary로 변경
cust_df = cust_df.rename(columns = {'InvoiceDate':'Recency',
                                    'InvoiceNo':'Frequency',
                                    'sale_amount':'Monetary'
                                   }
                        )
cust_df = cust_df.reset_index()
cust_df.head(3)




	CustomerID	Recency	Frequency	Monetary
0	12346	2011-01-18 10:01:00	1	77183.60
1	12747	2011-12-07 14:34:00	103	4196.01
2	12748	2011-12-09 12:20:00	4595	33719.73

Recency를 날짜에서 정수형으로 가공

cust_df['Recency'].max()


# Timestamp('2011-12-09 12:49:00')

import datetime as dt

cust_df['Recency'] = dt.datetime(2011,12,10) - cust_df['Recency']
cust_df['Recency'] = cust_df['Recency'].apply(lambda x: x.days+1)
print('cust_df 로우와 컬럼 건수는 ',cust_df.shape)
cust_df.head(3)




cust_df 로우와 컬럼 건수는  (3920, 4)
CustomerID	Recency	Frequency	Monetary
0	12346	326	1	77183.60
1	12747	3	103	4196.01
2	12748	1	4595	33719.73

RFM 기반 고객 세그먼테이션

Recency, Frequency, Monetary 값의 분포도 확인

fig, (ax1,ax2,ax3) = plt.subplots(figsize=(12,4), nrows=1, ncols=3)
ax1.set_title('Recency Histogram')
ax1.hist(cust_df['Recency'])

ax2.set_title('Frequency Histogram')
ax2.hist(cust_df['Frequency'])

ax3.set_title('Monetary Histogram')
ax3.hist(cust_df['Monetary'])


(array([3.887e+03, 1.900e+01, 9.000e+00, 2.000e+00, 0.000e+00, 0.000e+00,
        1.000e+00, 1.000e+00, 0.000e+00, 1.000e+00]),
 array([3.75000000e+00, 2.59691050e+04, 5.19344600e+04, 7.78998150e+04,
        1.03865170e+05, 1.29830525e+05, 1.55795880e+05, 1.81761235e+05,
        2.07726590e+05, 2.33691945e+05, 2.59657300e+05]),
 <a list of 10 Patch objects>)

cust_df[['Recency','Frequency','Monetary']].describe()


	Recency	Frequency	Monetary
count	3920.000000	3920.000000	3920.000000
mean	92.742092	90.388010	1864.385601
std	99.533485	217.808385	7482.817477
min	1.000000	1.000000	3.750000
25%	18.000000	17.000000	300.280000
50%	51.000000	41.000000	652.280000
75%	143.000000	99.250000	1576.585000
max	374.000000	7847.000000	259657.300000

K-Means로 군집화 후에 실루엣 계수 평가

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples

X_features = cust_df[['Recency','Frequency','Monetary']].values
X_features_scaled = StandardScaler().fit_transform(X_features)

kmeans = KMeans(n_clusters=3, random_state=0)
labels = kmeans.fit_predict(X_features_scaled)
cust_df['cluster_label'] = labels

print('실루엣 스코어는 : {0:.3f}'.format(silhouette_score(X_features_scaled,labels)))




# 실루엣 스코어는 : 0.592

K-Means 군집화 후에 실루엣 계수 및 군집을 시각화

### 여러개의 클러스터링 갯수를 List로 입력 받아 각각의 실루엣 계수를 면적으로 시각화한 함수 작성  
def visualize_silhouette(cluster_lists, X_features): 
    
    from sklearn.datasets import make_blobs
    from sklearn.cluster import KMeans
    from sklearn.metrics import silhouette_samples, silhouette_score

    import matplotlib.pyplot as plt
    import matplotlib.cm as cm
    import math
    
    # 입력값으로 클러스터링 갯수들을 리스트로 받아서, 각 갯수별로 클러스터링을 적용하고 실루엣 개수를 구함
    n_cols = len(cluster_lists)
    
    # plt.subplots()으로 리스트에 기재된 클러스터링 만큼의 sub figures를 가지는 axs 생성 
    fig, axs = plt.subplots(figsize=(4*n_cols, 4), nrows=1, ncols=n_cols)
    
    # 리스트에 기재된 클러스터링 갯수들을 차례로 iteration 수행하면서 실루엣 개수 시각화
    for ind, n_cluster in enumerate(cluster_lists):
        
        # KMeans 클러스터링 수행하고, 실루엣 스코어와 개별 데이터의 실루엣 값 계산. 
        clusterer = KMeans(n_clusters = n_cluster, max_iter=500, random_state=0)
        cluster_labels = clusterer.fit_predict(X_features)
        
        sil_avg = silhouette_score(X_features, cluster_labels)
        sil_values = silhouette_samples(X_features, cluster_labels)
        
        y_lower = 10
        axs[ind].set_title('Number of Cluster : '+ str(n_cluster)+'\n' \
                          'Silhouette Score :' + str(round(sil_avg,3)) )
        axs[ind].set_xlabel("The silhouette coefficient values")
        axs[ind].set_ylabel("Cluster label")
        axs[ind].set_xlim([-0.1, 1])
        axs[ind].set_ylim([0, len(X_features) + (n_cluster + 1) * 10])
        axs[ind].set_yticks([])  # Clear the yaxis labels / ticks
        axs[ind].set_xticks([0, 0.2, 0.4, 0.6, 0.8, 1])
        
        # 클러스터링 갯수별로 fill_betweenx( )형태의 막대 그래프 표현. 
        for i in range(n_cluster):
            ith_cluster_sil_values = sil_values[cluster_labels==i]
            ith_cluster_sil_values.sort()
            
            size_cluster_i = ith_cluster_sil_values.shape[0]
            y_upper = y_lower + size_cluster_i
            
            color = cm.nipy_spectral(float(i) / n_cluster)
            axs[ind].fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_sil_values, \
                                facecolor=color, edgecolor=color, alpha=0.7)
            axs[ind].text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
            y_lower = y_upper + 10
            
        axs[ind].axvline(x=sil_avg, color="red", linestyle="--")

### 여러개의 클러스터링 갯수를 List로 입력 받아 각각의 클러스터링 결과를 시각화 
def visualize_kmeans_plot_multi(cluster_lists, X_features):
    
    from sklearn.cluster import KMeans
    from sklearn.decomposition import PCA
    import pandas as pd
    import numpy as np
    
    # plt.subplots()으로 리스트에 기재된 클러스터링 만큼의 sub figures를 가지는 axs 생성 
    n_cols = len(cluster_lists)
    fig, axs = plt.subplots(figsize=(4*n_cols, 4), nrows=1, ncols=n_cols)
    
    # 입력 데이터의 FEATURE가 여러개일 경우 2차원 데이터 시각화가 어려우므로 PCA 변환하여 2차원 시각화
    pca = PCA(n_components=2)
    pca_transformed = pca.fit_transform(X_features)
    dataframe = pd.DataFrame(pca_transformed, columns=['PCA1','PCA2'])
    
     # 리스트에 기재된 클러스터링 갯수들을 차례로 iteration 수행하면서 KMeans 클러스터링 수행하고 시각화
    for ind, n_cluster in enumerate(cluster_lists):
        
        # KMeans 클러스터링으로 클러스터링 결과를 dataframe에 저장. 
        clusterer = KMeans(n_clusters = n_cluster, max_iter=500, random_state=0)
        cluster_labels = clusterer.fit_predict(pca_transformed)
        dataframe['cluster']=cluster_labels
        
        unique_labels = np.unique(clusterer.labels_)
        markers=['o', 's', '^', 'x', '*']
       
        # 클러스터링 결과값 별로 scatter plot 으로 시각화
        for label in unique_labels:
            label_df = dataframe[dataframe['cluster']==label]
            if label == -1:
                cluster_legend = 'Noise'
            else :
                cluster_legend = 'Cluster '+str(label)           
            axs[ind].scatter(x=label_df['PCA1'], y=label_df['PCA2'], s=70,\
                        edgecolor='k', marker=markers[label], label=cluster_legend)

        axs[ind].set_title('Number of Cluster : '+ str(n_cluster))    
        axs[ind].legend(loc='upper right')
    
    plt.show()

visualize_silhouette([2,3,4,5],X_features_scaled)
visualize_kmeans_plot_multi([2,3,4,5],X_features_scaled)

로그 변환 후 재 시각화

### Log 변환을 통해 데이터 변환
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples

# Recency, Frequecny, Monetary 컬럼에 np.log1p() 로 Log Transformation
cust_df['Recency_log'] = np.log1p(cust_df['Recency'])
cust_df['Frequency_log'] = np.log1p(cust_df['Frequency'])
cust_df['Monetary_log'] = np.log1p(cust_df['Monetary'])

# Log Transformation 데이터에 StandardScaler 적용
X_features = cust_df[['Recency_log','Frequency_log','Monetary_log']].values
X_features_scaled = StandardScaler().fit_transform(X_features)

kmeans = KMeans(n_clusters=3, random_state=0)
labels = kmeans.fit_predict(X_features_scaled)
cust_df['cluster_label'] = labels

print('실루엣 스코어는 : {0:.3f}'.format(silhouette_score(X_features_scaled,labels)))



# 실루엣 스코어는 : 0.305

visualize_silhouette([2,3,4,5],X_features_scaled)
visualize_kmeans_plot_multi([2,3,4,5],X_features_scaled)

저작자표시 비영리

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

8-2. Bag of Words (0)	2022.01.02
8-1.텍스트 전처리 (0)	2022.01.02
7-9. DBSCAN 2 (0)	2021.12.30
7-8. DBSCAN (0)	2021.12.30
7-7. Gaussian_Mixture_Model (0)	2021.12.30

7-9. DBSCAN 2

2021. 12. 30. 12:06

728x90

DBSCAN 적용하기 – make_circles() 데이터 세트

### 클러스터 결과를 담은 DataFrame과 사이킷런의 Cluster 객체등을 인자로 받아 클러스터링 결과를 시각화하는 함수  
def visualize_cluster_plot(clusterobj, dataframe, label_name, iscenter=True):
    if iscenter :
        centers = clusterobj.cluster_centers_
        
    unique_labels = np.unique(dataframe[label_name].values)
    markers=['o', 's', '^', 'x', '*']
    isNoise=False

    for label in unique_labels:
        label_cluster = dataframe[dataframe[label_name]==label]
        if label == -1:
            cluster_legend = 'Noise'
            isNoise=True
        else :
            cluster_legend = 'Cluster '+str(label)
        
        plt.scatter(x=label_cluster['ftr1'], y=label_cluster['ftr2'], s=70,\
                    edgecolor='k', marker=markers[label], label=cluster_legend)
        
        if iscenter:
            center_x_y = centers[label]
            plt.scatter(x=center_x_y[0], y=center_x_y[1], s=250, color='white',
                        alpha=0.9, edgecolor='k', marker=markers[label])
            plt.scatter(x=center_x_y[0], y=center_x_y[1], s=70, color='k',\
                        edgecolor='k', marker='$%d$' % label)
    if isNoise:
        legend_loc='upper center'
    else: legend_loc='upper right'
    
    plt.legend(loc=legend_loc)
    plt.show()

from sklearn.datasets import make_circles

X, y = make_circles(n_samples=1000, shuffle=True, noise=0.05, random_state=0, factor=0.5)
clusterDF = pd.DataFrame(data=X, columns=['ftr1', 'ftr2'])
clusterDF['target'] = y

visualize_cluster_plot(None, clusterDF, 'target', iscenter=False)

# KMeans로 make_circles( ) 데이터 셋을 클러스터링 수행. 
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2, max_iter=1000, random_state=0)
kmeans_labels = kmeans.fit_predict(X)
clusterDF['kmeans_cluster'] = kmeans_labels

visualize_cluster_plot(kmeans, clusterDF, 'kmeans_cluster', iscenter=True)

# GMM으로 make_circles( ) 데이터 셋을 클러스터링 수행. 
from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=2, random_state=0)
gmm_label = gmm.fit(X).predict(X)
clusterDF['gmm_cluster'] = gmm_label

visualize_cluster_plot(gmm, clusterDF, 'gmm_cluster', iscenter=False)

# DBSCAN으로 make_circles( ) 데이터 셋을 클러스터링 수행. 
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.2, min_samples=10, metric='euclidean')
dbscan_labels = dbscan.fit_predict(X)
clusterDF['dbscan_cluster'] = dbscan_labels

visualize_cluster_plot(dbscan, clusterDF, 'dbscan_cluster', iscenter=False)

저작자표시 비영리

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

8-1.텍스트 전처리 (0)	2022.01.02
7-10. 고객 세그맨테이션 \|\| clustering (0)	2021.12.30
7-8. DBSCAN (0)	2021.12.30
7-7. Gaussian_Mixture_Model (0)	2021.12.30
7-6. Mean_Shift (0)	2021.12.30

7-8. DBSCAN

2021. 12. 30. 12:03

728x90

DBSCAN 적용하기 – 붓꽃 데이터 셋

from sklearn.datasets import load_iris

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline

iris = load_iris()
feature_names = ['sepal_length','sepal_width','petal_length','petal_width']

# 보다 편리한 데이타 Handling을 위해 DataFrame으로 변환
irisDF = pd.DataFrame(data=iris.data, columns=feature_names)
irisDF['target'] = iris.target
irisDF.head()



	sepal_length	sepal_width	petal_length	petal_width	target
0	5.1	3.5	1.4	0.2	0
1	4.9	3.0	1.4	0.2	0
2	4.7	3.2	1.3	0.2	0
3	4.6	3.1	1.5	0.2	0
4	5.0	3.6	1.4	0.2	0

eps 0.6 min_samples=8 로 DBSCAN 군집화 적용

from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.6, min_samples=8, metric='euclidean')
dbscan_labels = dbscan.fit_predict(iris.data)

irisDF['dbscan_cluster'] = dbscan_labels

iris_result = irisDF.groupby(['target'])['dbscan_cluster'].value_counts()
print(iris_result)



target  dbscan_cluster
0        0                49
        -1                 1
1        1                46
        -1                 4
2        1                42
        -1                 8
Name: dbscan_cluster, dtype: int64

### 클러스터 결과를 담은 DataFrame과 사이킷런의 Cluster 객체등을 인자로 받아 클러스터링 결과를 시각화하는 함수  
def visualize_cluster_plot(clusterobj, dataframe, label_name, iscenter=True):
    if iscenter :
        centers = clusterobj.cluster_centers_
        
    unique_labels = np.unique(dataframe[label_name].values)
    markers=['o', 's', '^', 'x', '*']
    isNoise=False

    for label in unique_labels:
        label_cluster = dataframe[dataframe[label_name]==label]
        if label == -1:
            cluster_legend = 'Noise'
            isNoise=True
        else :
            cluster_legend = 'Cluster '+str(label)
        
        plt.scatter(x=label_cluster['ftr1'], y=label_cluster['ftr2'], s=70,\
                    edgecolor='k', marker=markers[label], label=cluster_legend)
        
        if iscenter:
            center_x_y = centers[label]
            plt.scatter(x=center_x_y[0], y=center_x_y[1], s=250, color='white',
                        alpha=0.9, edgecolor='k', marker=markers[label])
            plt.scatter(x=center_x_y[0], y=center_x_y[1], s=70, color='k',\
                        edgecolor='k', marker='$%d$' % label)
    if isNoise:
        legend_loc='upper center'
    else: legend_loc='upper right'
    
    plt.legend(loc=legend_loc)
    plt.show()

PCA 2개 컴포넌트로 기존 feature들을 차원 축소 후 시각화

from sklearn.decomposition import PCA
# 2차원으로 시각화하기 위해 PCA n_componets=2로 피처 데이터 세트 변환
pca = PCA(n_components=2, random_state=0)
pca_transformed = pca.fit_transform(iris.data)
# visualize_cluster_2d( ) 함수는 ftr1, ftr2 컬럼을 좌표에 표현하므로 PCA 변환값을 해당 컬럼으로 생성
irisDF['ftr1'] = pca_transformed[:,0]
irisDF['ftr2'] = pca_transformed[:,1]

visualize_cluster_plot(dbscan, irisDF, 'dbscan_cluster', iscenter=False)

eps의 크기를 증가 한 후 노이즈 확인

from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=0.8, min_samples=8, metric='euclidean')
dbscan_labels = dbscan.fit_predict(iris.data)

irisDF['dbscan_cluster'] = dbscan_labels
irisDF['target'] = iris.target

iris_result = irisDF.groupby(['target'])['dbscan_cluster'].value_counts()
print(iris_result)

visualize_cluster_plot(dbscan, irisDF, 'dbscan_cluster', iscenter=False)


target  dbscan_cluster
0        0                50
1        1                50
2        1                47
        -1                 3
Name: dbscan_cluster, dtype: int64

min_samples의 크기를 증가 후 노이즈 확인

dbscan = DBSCAN(eps=0.6, min_samples=16, metric='euclidean')
dbscan_labels = dbscan.fit_predict(iris.data)

irisDF['dbscan_cluster'] = dbscan_labels
irisDF['target'] = iris.target

iris_result = irisDF.groupby(['target'])['dbscan_cluster'].value_counts()
print(iris_result)
visualize_cluster_plot(dbscan, irisDF, 'dbscan_cluster', iscenter=False)




target  dbscan_cluster
0        0                48
        -1                 2
1        1                44
        -1                 6
2        1                36
        -1                14
Name: dbscan_cluster, dtype: int64

저작자표시 비영리

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

7-10. 고객 세그맨테이션 \|\| clustering (0)	2021.12.30
7-9. DBSCAN 2 (0)	2021.12.30
7-7. Gaussian_Mixture_Model (0)	2021.12.30
7-6. Mean_Shift (0)	2021.12.30
7-4. KDE(Kernel Density Estimation) (0)	2021.12.30

PREV 1 ···11 12 13 14 15 16 17 ···43 NEXT

전체 글

Opinion Review 데이터 세트를 이용한 문서 군집화 수행하기

군집(Cluster)별 핵심 단어 추출하기

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

20 Newsgroup 토픽 모델링

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

비지도학습 기반 감성 분석

SentiWordNet을 이용한 Sentiment Analysis

VADER lexicon을 이용한 Sentiment Analysis

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

지도학습 기반 감성 분석 실습

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

Count 피처 벡터화 변환과 머신러닝 모델 학습/예측/평가

사이킷런 파이프라인(Pipeline) 사용 및 GridSearchCV와의 결합

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

희소 행렬 - COO 형식

희소 행렬 – CSR 형식

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

Text Tokenization

Stopwords 제거

Stemming과 Lemmatization

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

데이터 셋 로딩과 데이터 클린징

RFM 기반 데이터 가공

RFM 기반 고객 세그먼테이션

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

DBSCAN 적용하기 – make_circles() 데이터 세트

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

DBSCAN 적용하기 – 붓꽃 데이터 셋

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

티스토리툴바