62. Tokenizer

2021. 12. 7. 19:55

728x90

import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding, Bidirectional

vocab_size = 15000

def create_model() :
    model = Sequential([
        Embedding(vocab_size, 32),
        Bidirectional(LSTM(32, return_sequences=True)),
        Dense(32, activation='relu'),
        Dense(1, activation='sigmoid')
    ])
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

import pandas as pd
test_file = tf.keras.utils.get_file('ratings_test.txt',\
            origin='https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt', extract=True)
test = pd.read_csv(test_file, sep='\t')
test.head()


id	document	label
0	6270596	굳 ㅋ	1
1	9274899	GDNTOPCLASSINTHECLUB	0
2	8544678	뭐야 이 평점들은.... 나쁘진 않지만 10점 짜리는 더더욱 아니잖아	0
3	6825595	지루하지는 않은데 완전 막장임... 돈주고 보기에는....	0
4	6723715	3D만 아니었어도 별 다섯 개 줬을텐데.. 왜 3D로 나와서 제 심기를 불편하게 하죠??	0

test.shape 
# (50000, 3)

import konlpy
from konlpy.tag import Okt
okt = Okt()

train_file = tf.keras.utils.get_file('ratings_train.txt',\
            origin='https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt', extract=True)
train = pd.read_csv(train_file, sep='\t')
train.head()

id	document	label
0	9976970	아 더빙.. 진짜 짜증나네요 목소리	0
1	3819312	흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나	1
2	10265843	너무재밓었다그래서보는것을추천한다	0
3	9045019	교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정	0
4	6483659	사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...	1

train['document'] = train['document'].str.replace("[^A-Za-z가-힣ㄱ-ㅎㅏ-ㅣ]","")
train = train.dropna()

def word_tokenization(text) :
    stop_words = ['는','을','를','이','가','의','던','고','하','다','은','에','들','지','게','도']
    return [word for word in okt.morphs(text) if word not in stop_words]

data = train['document'].apply((lambda x : word_tokenization(x)))
data.head()

0                              [아더, 빙, 진짜, 짜증나네요, 목소리]
1        [흠, 포스터, 보고, 초딩, 영화, 줄, 오버, 연기, 조차, 가볍지, 않구나]
2                     [너, 무재, 밓었, 다그, 래서, 보는것을, 추천, 한]
3                  [교도소, 이야기, 구먼, 솔직히, 재미, 없다, 평점, 조정]
4    [사이, 몬페, 그, 익살스런, 연기, 돋보였던, 영화, 스파이더맨, 에서, 늙어,...
Name: document, dtype: object

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

oov_tok = "<OOV>"
vovab_size = 15000
tokenizer = Tokenizer(oov_token = oov_tok, num_words=vocab_size)
tokenizer.fit_on_texts(data)

테스트 데이터 전처리
1. 한글, 영문, 공백 제외한 모든 문자 제거
2. 결측값 제거
3. 테스트할 데이터, 레이블 데이터 분리
4. 테스트할 데이터 불용어 부분제거
5. tokenizer를 이용하여 분석할 수 있는 데이터로 변경
6. 패딩하기

import numpy as np
def preprocessing(df) :
    df['document'] = df['document'].str.replace("[^A-Za-z가-힣ㄱ-ㅎㅏ-ㅣ]","")
    df = df.dropna()
    test_label = np.asarray(df['label'])
    test_data = df['document'].apply((lambda x : word_tokenization(x)))
    test_data = tokenizer.texts_to_sequences(test_data)
    test_data = pad_sequences(test_data, padding='post', maxlen=69)
    return test_data, test_label

test_data, test_label = preprocessing(test)
test_data[2:3]
test_label[2:3]


# array([0], dtype=int64)

# 평가
model2 = create_model()
model2.evaluate(test_data, test_label)



1563/1563 [==============================] - 6s 3ms/step - loss: 0.6931 - accuracy: 0.5039
[0.6931077837944031, 0.5039322972297668]

# 저장된 모델을 로드 후 평가하기
checkpoint_path = 'best_performed_model.ckpt'
model2.load_weights(checkpoint_path)
model2.evaluate(test_data, test_label)



1563/1563 [==============================] - 4s 3ms/step - loss: 1.1676 - accuracy: 0.4926
[1.1676305532455444, 0.49262505769729614]

print("감동 ==>>", tokenizer.word_index['감동'])
print("영화 ==>>", tokenizer.word_index['영화'])
print("나나 ==>>", tokenizer.word_index['나나'])



감동 ==>> 28
영화 ==>> 2
나나 ==>> 3533

저작자표시 비영리

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

61. 네이버 영화리뷰 \|\| LSTM (0)	2021.12.07
60. LSTM 기본 (0)	2021.12.07
58. IMDB \|\| SimpleRNN (0)	2021.12.07
57. seed \|\| simpleRNN (0)	2021.12.07
56. 영화리뷰 분석 (0)	2021.12.07

My_Flow

62. Tokenizer

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

+ Recent posts

티스토리툴바