728x90
반응형
import lightgbm print(lightgbm.__version__) # 2.1.0

LightGBM 적용 – 위스콘신 Breast Cancer Prediction

# LightGBM의 파이썬 패키지인 lightgbm에서 LGBMClassifier 임포트 from lightgbm import LGBMClassifier import pandas as pd import numpy as np from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split dataset = load_breast_cancer() ftr = dataset.data target = dataset.target # 전체 데이터 중 80%는 학습용 데이터, 20%는 테스트용 데이터 추출 X_train, X_test, y_train, y_test=train_test_split(ftr, target, test_size=0.2, random_state=156 ) # 앞서 XGBoost와 동일하게 n_estimators는 400 설정. lgbm_wrapper = LGBMClassifier(n_estimators=400) # LightGBM도 XGBoost와 동일하게 조기 중단 수행 가능. evals = [(X_test, y_test)] lgbm_wrapper.fit(X_train, y_train, early_stopping_rounds=100, eval_metric="logloss", eval_set=evals, verbose=True) preds = lgbm_wrapper.predict(X_test) pred_proba = lgbm_wrapper.predict_proba(X_test)[:, 1] [1] valid_0's binary_logloss: 0.565079 Training until validation scores don't improve for 100 rounds [2] valid_0's binary_logloss: 0.507451 [3] valid_0's binary_logloss: 0.458489 [4] valid_0's binary_logloss: 0.417481 [5] valid_0's binary_logloss: 0.385507 [6] valid_0's binary_logloss: 0.355773 [7] valid_0's binary_logloss: 0.329587 [8] valid_0's binary_logloss: 0.308478 [9] valid_0's binary_logloss: 0.285395 [10] valid_0's binary_logloss: 0.267055 [11] valid_0's binary_logloss: 0.252013 [12] valid_0's binary_logloss: 0.237018 [13] valid_0's binary_logloss: 0.224756 [14] valid_0's binary_logloss: 0.213383 [15] valid_0's binary_logloss: 0.203058 [16] valid_0's binary_logloss: 0.194015 [17] valid_0's binary_logloss: 0.186412 [18] valid_0's binary_logloss: 0.179108 [19] valid_0's binary_logloss: 0.174004 [20] valid_0's binary_logloss: 0.167155 [21] valid_0's binary_logloss: 0.162494 [22] valid_0's binary_logloss: 0.156886 [23] valid_0's binary_logloss: 0.152855 [24] valid_0's binary_logloss: 0.151113 [25] valid_0's binary_logloss: 0.148395 [26] valid_0's binary_logloss: 0.145869 [27] valid_0's binary_logloss: 0.143036 [28] valid_0's binary_logloss: 0.14033 [29] valid_0's binary_logloss: 0.139609 [30] valid_0's binary_logloss: 0.136109 [31] valid_0's binary_logloss: 0.134867 [32] valid_0's binary_logloss: 0.134729 [33] valid_0's binary_logloss: 0.1311 [34] valid_0's binary_logloss: 0.131143 [35] valid_0's binary_logloss: 0.129435 [36] valid_0's binary_logloss: 0.128474 [37] valid_0's binary_logloss: 0.126683 [38] valid_0's binary_logloss: 0.126112 [39] valid_0's binary_logloss: 0.122831 [40] valid_0's binary_logloss: 0.123162 [41] valid_0's binary_logloss: 0.125592 [42] valid_0's binary_logloss: 0.128293 [43] valid_0's binary_logloss: 0.128123 [44] valid_0's binary_logloss: 0.12789 [45] valid_0's binary_logloss: 0.122818 [46] valid_0's binary_logloss: 0.12496 [47] valid_0's binary_logloss: 0.125578 [48] valid_0's binary_logloss: 0.127381 [49] valid_0's binary_logloss: 0.128349 [50] valid_0's binary_logloss: 0.127004 [51] valid_0's binary_logloss: 0.130288 [52] valid_0's binary_logloss: 0.131362 [53] valid_0's binary_logloss: 0.133363 [54] valid_0's binary_logloss: 0.1332 [55] valid_0's binary_logloss: 0.134543 [56] valid_0's binary_logloss: 0.130803 [57] valid_0's binary_logloss: 0.130306 [58] valid_0's binary_logloss: 0.132514 [59] valid_0's binary_logloss: 0.133278 [60] valid_0's binary_logloss: 0.134804 [61] valid_0's binary_logloss: 0.136888 [62] valid_0's binary_logloss: 0.138745 [63] valid_0's binary_logloss: 0.140497 [64] valid_0's binary_logloss: 0.141368 [65] valid_0's binary_logloss: 0.140764 [66] valid_0's binary_logloss: 0.14348 [67] valid_0's binary_logloss: 0.143418 [68] valid_0's binary_logloss: 0.143682 [69] valid_0's binary_logloss: 0.145076 [70] valid_0's binary_logloss: 0.14686 [71] valid_0's binary_logloss: 0.148051 [72] valid_0's binary_logloss: 0.147664 [73] valid_0's binary_logloss: 0.149478 [74] valid_0's binary_logloss: 0.14708 [75] valid_0's binary_logloss: 0.14545 [76] valid_0's binary_logloss: 0.148767 [77] valid_0's binary_logloss: 0.149959 [78] valid_0's binary_logloss: 0.146083 [79] valid_0's binary_logloss: 0.14638 [80] valid_0's binary_logloss: 0.148461 [81] valid_0's binary_logloss: 0.15091 [82] valid_0's binary_logloss: 0.153011 [83] valid_0's binary_logloss: 0.154807 [84] valid_0's binary_logloss: 0.156501 [85] valid_0's binary_logloss: 0.158586 [86] valid_0's binary_logloss: 0.159819 [87] valid_0's binary_logloss: 0.161745 [88] valid_0's binary_logloss: 0.162829 [89] valid_0's binary_logloss: 0.159142 [90] valid_0's binary_logloss: 0.156765 [91] valid_0's binary_logloss: 0.158625 [92] valid_0's binary_logloss: 0.156832 [93] valid_0's binary_logloss: 0.154616 [94] valid_0's binary_logloss: 0.154263 [95] valid_0's binary_logloss: 0.157156 [96] valid_0's binary_logloss: 0.158617 [97] valid_0's binary_logloss: 0.157495 [98] valid_0's binary_logloss: 0.159413 [99] valid_0's binary_logloss: 0.15847 [100] valid_0's binary_logloss: 0.160746 [101] valid_0's binary_logloss: 0.16217 [102] valid_0's binary_logloss: 0.165293 [103] valid_0's binary_logloss: 0.164749 [104] valid_0's binary_logloss: 0.167097 [105] valid_0's binary_logloss: 0.167697 [106] valid_0's binary_logloss: 0.169462 [107] valid_0's binary_logloss: 0.169947 [108] valid_0's binary_logloss: 0.171 [109] valid_0's binary_logloss: 0.16907 [110] valid_0's binary_logloss: 0.169521 [111] valid_0's binary_logloss: 0.167719 [112] valid_0's binary_logloss: 0.166648 [113] valid_0's binary_logloss: 0.169053 [114] valid_0's binary_logloss: 0.169613 [115] valid_0's binary_logloss: 0.170059 [116] valid_0's binary_logloss: 0.1723 [117] valid_0's binary_logloss: 0.174733 [118] valid_0's binary_logloss: 0.173526 [119] valid_0's binary_logloss: 0.1751 [120] valid_0's binary_logloss: 0.178254 [121] valid_0's binary_logloss: 0.182968 [122] valid_0's binary_logloss: 0.179017 [123] valid_0's binary_logloss: 0.178326 [124] valid_0's binary_logloss: 0.177149 [125] valid_0's binary_logloss: 0.179171 [126] valid_0's binary_logloss: 0.180948 [127] valid_0's binary_logloss: 0.183861 [128] valid_0's binary_logloss: 0.187579 [129] valid_0's binary_logloss: 0.188122 [130] valid_0's binary_logloss: 0.1857 [131] valid_0's binary_logloss: 0.187442 [132] valid_0's binary_logloss: 0.188578 [133] valid_0's binary_logloss: 0.189729 [134] valid_0's binary_logloss: 0.187313 [135] valid_0's binary_logloss: 0.189279 [136] valid_0's binary_logloss: 0.191068 [137] valid_0's binary_logloss: 0.192414 [138] valid_0's binary_logloss: 0.191255 [139] valid_0's binary_logloss: 0.193453 [140] valid_0's binary_logloss: 0.196969 [141] valid_0's binary_logloss: 0.196378 [142] valid_0's binary_logloss: 0.196367 [143] valid_0's binary_logloss: 0.19869 [144] valid_0's binary_logloss: 0.200352 [145] valid_0's binary_logloss: 0.19712 Early stopping, best iteration is: [45] valid_0's binary_logloss: 0.122818

from sklearn.metrics import confusion_matrix, accuracy_score from sklearn.metrics import precision_score, recall_score from sklearn.metrics import f1_score, roc_auc_score # 수정된 get_clf_eval() 함수 def get_clf_eval(y_test, pred=None, pred_proba=None): confusion = confusion_matrix( y_test, pred) accuracy = accuracy_score(y_test , pred) precision = precision_score(y_test , pred) recall = recall_score(y_test , pred) f1 = f1_score(y_test,pred) # ROC-AUC 추가 roc_auc = roc_auc_score(y_test, pred_proba) print('오차 행렬') print(confusion) # ROC-AUC print 추가 print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f},\ F1: {3:.4f}, AUC:{4:.4f}'.format(accuracy, precision, recall, f1, roc_auc))

get_clf_eval(y_test, preds, pred_proba) 오차 행렬 [[33 4] [ 1 76]] 정확도: 0.9561, 정밀도: 0.9500, 재현율: 0.9870, F1: 0.9682, AUC:0.9905

from lightgbm import plot_importance import matplotlib.pyplot as plt %matplotlib inline fig, ax = plt.subplots(figsize=(10, 12)) # 사이킷런 래퍼 클래스를 입력해도 무방. plot_importance(lgbm_wrapper, ax=ax)

print(dataset.feature_names) ['mean radius' 'mean texture' 'mean perimeter' 'mean area' 'mean smoothness' 'mean compactness' 'mean concavity' 'mean concave points' 'mean symmetry' 'mean fractal dimension' 'radius error' 'texture error' 'perimeter error' 'area error' 'smoothness error' 'compactness error' 'concavity error' 'concave points error' 'symmetry error' 'fractal dimension error' 'worst radius' 'worst texture' 'worst perimeter' 'worst area' 'worst smoothness' 'worst compactness' 'worst concavity' 'worst concave points' 'worst symmetry' 'worst fractal dimension']
반응형
728x90
반응형
  • XGBoost 버전 확인 ★
import xgboost

print(xgboost.__version__)

# 1.3.3

 

파이썬 래퍼 XGBoost 적용 – 위스콘신 Breast Cancer 데이터 셋

** 데이터 세트 로딩 **

import xgboost as xgb
from xgboost import plot_importance

import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

dataset = load_breast_cancer()
X_features= dataset.data
y_label = dataset.target

cancer_df = pd.DataFrame(data=X_features, columns=dataset.feature_names)
cancer_df['target']= y_label
cancer_df.head(3)


mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension	target
0	17.99	10.38	122.8	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	...	17.33	184.6	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890	0
1	20.57	17.77	132.9	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...	23.41	158.8	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902	0
2	19.69	21.25	130.0	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	...	25.53	152.5	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758	0
3 rows × 31 columns

 

print(dataset.target_names)
print(cancer_df['target'].value_counts())

# ['malignant' 'benign']
# 1    357
# 0    212
# Name: target, dtype: int64

 

# 전체 데이터 중 80%는 학습용 데이터, 20%는 테스트용 데이터 추출
X_train, X_test, y_train, y_test=train_test_split(X_features, y_label,
                                         test_size=0.2, random_state=156 )
print(X_train.shape , X_test.shape)
# (455, 30) (114, 30)

 

학습과 예측 데이터 세트를 DMatrix로 변환

dtrain = xgb.DMatrix(data=X_train , label=y_train)
dtest = xgb.DMatrix(data=X_test , label=y_test)

 

하이퍼 파라미터 설정

params = { 'max_depth':3,
           'eta': 0.1,
           'objective':'binary:logistic',
           'eval_metric':'logloss',
           'early_stoppings':100
        }
num_rounds = 400

 

주어진 하이퍼 파라미터와 early stopping 파라미터를 train( ) 함수의 파라미터로 전달하고 학습

# train 데이터 셋은 ‘train’ , evaluation(test) 데이터 셋은 ‘eval’ 로 명기합니다. 
wlist = [(dtrain,'train'),(dtest,'eval') ]
# 하이퍼 파라미터와 early stopping 파라미터를 train( ) 함수의 파라미터로 전달
xgb_model = xgb.train(params = params , dtrain=dtrain , num_boost_round=num_rounds , evals=wlist )


[0]	train-logloss:0.609688	eval-logloss:0.61352
[1]	train-logloss:0.540803	eval-logloss:0.547842
[2]	train-logloss:0.483753	eval-logloss:0.494248
[3]	train-logloss:0.434457	eval-logloss:0.447986
[4]	train-logloss:0.39055	eval-logloss:0.409109
[5]	train-logloss:0.354145	eval-logloss:0.374977
[6]	train-logloss:0.321222	eval-logloss:0.345714
[7]	train-logloss:0.292592	eval-logloss:0.320529
[8]	train-logloss:0.267467	eval-logloss:0.29721
[9]	train-logloss:0.245153	eval-logloss:0.277991
[10]	train-logloss:0.225694	eval-logloss:0.260302
[11]	train-logloss:0.207937	eval-logloss:0.246037
[12]	train-logloss:0.192184	eval-logloss:0.231556
[13]	train-logloss:0.177916	eval-logloss:0.22005
[14]	train-logloss:0.165222	eval-logloss:0.208572
[15]	train-logloss:0.153622	eval-logloss:0.199993
[16]	train-logloss:0.14333	eval-logloss:0.190118
[17]	train-logloss:0.133985	eval-logloss:0.181818
[18]	train-logloss:0.125599	eval-logloss:0.174729
[19]	train-logloss:0.117286	eval-logloss:0.167657
[20]	train-logloss:0.109688	eval-logloss:0.158202
[21]	train-logloss:0.102975	eval-logloss:0.154725
[22]	train-logloss:0.097067	eval-logloss:0.148947
[23]	train-logloss:0.091428	eval-logloss:0.143308
[24]	train-logloss:0.086335	eval-logloss:0.136344
[25]	train-logloss:0.081311	eval-logloss:0.132778
[26]	train-logloss:0.076857	eval-logloss:0.127912
[27]	train-logloss:0.072836	eval-logloss:0.125263
[28]	train-logloss:0.069248	eval-logloss:0.119978
[29]	train-logloss:0.065549	eval-logloss:0.116412
[30]	train-logloss:0.062414	eval-logloss:0.114502
[31]	train-logloss:0.059591	eval-logloss:0.112572
[32]	train-logloss:0.057096	eval-logloss:0.11154
[33]	train-logloss:0.054407	eval-logloss:0.108681
[34]	train-logloss:0.052036	eval-logloss:0.106681
[35]	train-logloss:0.049751	eval-logloss:0.104207
[36]	train-logloss:0.04775	eval-logloss:0.102962
[37]	train-logloss:0.045853	eval-logloss:0.100576
[38]	train-logloss:0.044015	eval-logloss:0.098683
[39]	train-logloss:0.042263	eval-logloss:0.096444
[40]	train-logloss:0.040649	eval-logloss:0.095869
[41]	train-logloss:0.039126	eval-logloss:0.094242
[42]	train-logloss:0.037377	eval-logloss:0.094715
[43]	train-logloss:0.036106	eval-logloss:0.094272
[44]	train-logloss:0.034941	eval-logloss:0.093894
[45]	train-logloss:0.033654	eval-logloss:0.094184
[46]	train-logloss:0.032528	eval-logloss:0.09402
[47]	train-logloss:0.031485	eval-logloss:0.09236
[48]	train-logloss:0.030389	eval-logloss:0.093012
[49]	train-logloss:0.029467	eval-logloss:0.091273
[50]	train-logloss:0.028545	eval-logloss:0.090051
[51]	train-logloss:0.027525	eval-logloss:0.089605
[52]	train-logloss:0.026555	eval-logloss:0.089577
[53]	train-logloss:0.025682	eval-logloss:0.090703
[54]	train-logloss:0.025004	eval-logloss:0.089579
[55]	train-logloss:0.024297	eval-logloss:0.090357
[56]	train-logloss:0.023574	eval-logloss:0.091587
[57]	train-logloss:0.022965	eval-logloss:0.091527
[58]	train-logloss:0.022488	eval-logloss:0.091986
[59]	train-logloss:0.021854	eval-logloss:0.091951
[60]	train-logloss:0.021316	eval-logloss:0.091939
[61]	train-logloss:0.020794	eval-logloss:0.091461
[62]	train-logloss:0.020218	eval-logloss:0.090311
[63]	train-logloss:0.019701	eval-logloss:0.089407
[64]	train-logloss:0.01918	eval-logloss:0.089719
[65]	train-logloss:0.018724	eval-logloss:0.089743
[66]	train-logloss:0.018325	eval-logloss:0.089622
[67]	train-logloss:0.017867	eval-logloss:0.088734
[68]	train-logloss:0.017598	eval-logloss:0.088621
[69]	train-logloss:0.017243	eval-logloss:0.089739
[70]	train-logloss:0.01688	eval-logloss:0.089981
[71]	train-logloss:0.016641	eval-logloss:0.089782
[72]	train-logloss:0.016287	eval-logloss:0.089584
[73]	train-logloss:0.015983	eval-logloss:0.089533
[74]	train-logloss:0.015658	eval-logloss:0.088748
[75]	train-logloss:0.015393	eval-logloss:0.088597
[76]	train-logloss:0.015151	eval-logloss:0.08812
[77]	train-logloss:0.01488	eval-logloss:0.088396
[78]	train-logloss:0.014637	eval-logloss:0.088736
[79]	train-logloss:0.014491	eval-logloss:0.088153
[80]	train-logloss:0.014185	eval-logloss:0.087577
[81]	train-logloss:0.014005	eval-logloss:0.087412
[82]	train-logloss:0.013772	eval-logloss:0.08849
[83]	train-logloss:0.013568	eval-logloss:0.088575
[84]	train-logloss:0.013414	eval-logloss:0.08807
[85]	train-logloss:0.013253	eval-logloss:0.087641
[86]	train-logloss:0.013109	eval-logloss:0.087416
[87]	train-logloss:0.012926	eval-logloss:0.087611
[88]	train-logloss:0.012714	eval-logloss:0.087065
[89]	train-logloss:0.012544	eval-logloss:0.08727
[90]	train-logloss:0.012353	eval-logloss:0.087161
[91]	train-logloss:0.012226	eval-logloss:0.086962
[92]	train-logloss:0.012065	eval-logloss:0.087166
[93]	train-logloss:0.011927	eval-logloss:0.087067
[94]	train-logloss:0.011821	eval-logloss:0.086592
[95]	train-logloss:0.011649	eval-logloss:0.086116
[96]	train-logloss:0.011482	eval-logloss:0.087139
[97]	train-logloss:0.01136	eval-logloss:0.086768
[98]	train-logloss:0.011239	eval-logloss:0.086694
[99]	train-logloss:0.011132	eval-logloss:0.086547
[100]	train-logloss:0.011002	eval-logloss:0.086498
[101]	train-logloss:0.010852	eval-logloss:0.08641
[102]	train-logloss:0.010755	eval-logloss:0.086288
[103]	train-logloss:0.010636	eval-logloss:0.086258
[104]	train-logloss:0.0105	eval-logloss:0.086835
[105]	train-logloss:0.010395	eval-logloss:0.086767
[106]	train-logloss:0.010305	eval-logloss:0.087321
[107]	train-logloss:0.010197	eval-logloss:0.087304
[108]	train-logloss:0.010072	eval-logloss:0.08728
[109]	train-logloss:0.01	eval-logloss:0.087298
[110]	train-logloss:0.009914	eval-logloss:0.087289
[111]	train-logloss:0.009798	eval-logloss:0.088002
[112]	train-logloss:0.00971	eval-logloss:0.087936
[113]	train-logloss:0.009628	eval-logloss:0.087843
[114]	train-logloss:0.009558	eval-logloss:0.088066
[115]	train-logloss:0.009483	eval-logloss:0.087649
[116]	train-logloss:0.009416	eval-logloss:0.087298
[117]	train-logloss:0.009306	eval-logloss:0.087799
[118]	train-logloss:0.009228	eval-logloss:0.087751
[119]	train-logloss:0.009154	eval-logloss:0.08768
[120]	train-logloss:0.009118	eval-logloss:0.087626
[121]	train-logloss:0.009016	eval-logloss:0.08757
[122]	train-logloss:0.008972	eval-logloss:0.087547
[123]	train-logloss:0.008904	eval-logloss:0.087156
[124]	train-logloss:0.008837	eval-logloss:0.08767
[125]	train-logloss:0.008803	eval-logloss:0.087737
[126]	train-logloss:0.008709	eval-logloss:0.088275
[127]	train-logloss:0.008645	eval-logloss:0.088309
[128]	train-logloss:0.008613	eval-logloss:0.088266
[129]	train-logloss:0.008555	eval-logloss:0.087886
[130]	train-logloss:0.008463	eval-logloss:0.088861
[131]	train-logloss:0.008416	eval-logloss:0.088675
[132]	train-logloss:0.008385	eval-logloss:0.088743
[133]	train-logloss:0.0083	eval-logloss:0.089218
[134]	train-logloss:0.00827	eval-logloss:0.089179
[135]	train-logloss:0.008218	eval-logloss:0.088821
[136]	train-logloss:0.008157	eval-logloss:0.088512
[137]	train-logloss:0.008076	eval-logloss:0.08848
[138]	train-logloss:0.008047	eval-logloss:0.088386
[139]	train-logloss:0.007973	eval-logloss:0.089145
[140]	train-logloss:0.007946	eval-logloss:0.08911
[141]	train-logloss:0.007898	eval-logloss:0.088765
[142]	train-logloss:0.007872	eval-logloss:0.088678
[143]	train-logloss:0.007847	eval-logloss:0.088389
[144]	train-logloss:0.007776	eval-logloss:0.089271
[145]	train-logloss:0.007752	eval-logloss:0.089238
[146]	train-logloss:0.007728	eval-logloss:0.089139
[147]	train-logloss:0.007689	eval-logloss:0.088907
[148]	train-logloss:0.007621	eval-logloss:0.089416
[149]	train-logloss:0.007598	eval-logloss:0.089388
[150]	train-logloss:0.007575	eval-logloss:0.089108
[151]	train-logloss:0.007521	eval-logloss:0.088735
[152]	train-logloss:0.007498	eval-logloss:0.088717
[153]	train-logloss:0.007464	eval-logloss:0.088484
[154]	train-logloss:0.00741	eval-logloss:0.088471
[155]	train-logloss:0.007389	eval-logloss:0.088545
[156]	train-logloss:0.007367	eval-logloss:0.088521
[157]	train-logloss:0.007345	eval-logloss:0.088547
[158]	train-logloss:0.007323	eval-logloss:0.088275
[159]	train-logloss:0.007303	eval-logloss:0.0883
[160]	train-logloss:0.007282	eval-logloss:0.08828
[161]	train-logloss:0.007261	eval-logloss:0.088013
[162]	train-logloss:0.007241	eval-logloss:0.087758
[163]	train-logloss:0.007221	eval-logloss:0.087784
[164]	train-logloss:0.0072	eval-logloss:0.087777
[165]	train-logloss:0.00718	eval-logloss:0.087517
[166]	train-logloss:0.007161	eval-logloss:0.087542
[167]	train-logloss:0.007142	eval-logloss:0.087642
[168]	train-logloss:0.007122	eval-logloss:0.08739
[169]	train-logloss:0.007103	eval-logloss:0.087377
[170]	train-logloss:0.007084	eval-logloss:0.087298
[171]	train-logloss:0.007065	eval-logloss:0.087368
[172]	train-logloss:0.007047	eval-logloss:0.087395
[173]	train-logloss:0.007028	eval-logloss:0.087385
[174]	train-logloss:0.007009	eval-logloss:0.087132
[175]	train-logloss:0.006991	eval-logloss:0.087159
[176]	train-logloss:0.006973	eval-logloss:0.086955
[177]	train-logloss:0.006955	eval-logloss:0.087053
[178]	train-logloss:0.006937	eval-logloss:0.08697
[179]	train-logloss:0.00692	eval-logloss:0.086973
[180]	train-logloss:0.006901	eval-logloss:0.087038
[181]	train-logloss:0.006884	eval-logloss:0.086799
[182]	train-logloss:0.006866	eval-logloss:0.086826
[183]	train-logloss:0.006849	eval-logloss:0.086582
[184]	train-logloss:0.006831	eval-logloss:0.086588
[185]	train-logloss:0.006815	eval-logloss:0.086614
[186]	train-logloss:0.006798	eval-logloss:0.086372
[187]	train-logloss:0.006781	eval-logloss:0.086369
[188]	train-logloss:0.006764	eval-logloss:0.086297
[189]	train-logloss:0.006747	eval-logloss:0.086104
[190]	train-logloss:0.00673	eval-logloss:0.086023
[191]	train-logloss:0.006714	eval-logloss:0.08605
[192]	train-logloss:0.006698	eval-logloss:0.086149
[193]	train-logloss:0.006682	eval-logloss:0.085916
[194]	train-logloss:0.006666	eval-logloss:0.085915
[195]	train-logloss:0.00665	eval-logloss:0.085984
[196]	train-logloss:0.006634	eval-logloss:0.086012
[197]	train-logloss:0.006618	eval-logloss:0.085922
[198]	train-logloss:0.006603	eval-logloss:0.085853
[199]	train-logloss:0.006587	eval-logloss:0.085874
[200]	train-logloss:0.006572	eval-logloss:0.085888
[201]	train-logloss:0.006556	eval-logloss:0.08595
[202]	train-logloss:0.006542	eval-logloss:0.08573
[203]	train-logloss:0.006527	eval-logloss:0.08573
[204]	train-logloss:0.006512	eval-logloss:0.085753
[205]	train-logloss:0.006497	eval-logloss:0.085821
[206]	train-logloss:0.006483	eval-logloss:0.08584
[207]	train-logloss:0.006469	eval-logloss:0.085776
[208]	train-logloss:0.006455	eval-logloss:0.085686
[209]	train-logloss:0.00644	eval-logloss:0.08571
[210]	train-logloss:0.006427	eval-logloss:0.085806
[211]	train-logloss:0.006413	eval-logloss:0.085593
[212]	train-logloss:0.006399	eval-logloss:0.085801
[213]	train-logloss:0.006385	eval-logloss:0.085807
[214]	train-logloss:0.006372	eval-logloss:0.085744
[215]	train-logloss:0.006359	eval-logloss:0.085658
[216]	train-logloss:0.006345	eval-logloss:0.085843
[217]	train-logloss:0.006332	eval-logloss:0.085632
[218]	train-logloss:0.006319	eval-logloss:0.085726
[219]	train-logloss:0.006306	eval-logloss:0.085783
[220]	train-logloss:0.006293	eval-logloss:0.085791
[221]	train-logloss:0.00628	eval-logloss:0.085817
[222]	train-logloss:0.006268	eval-logloss:0.085757
[223]	train-logloss:0.006255	eval-logloss:0.085674
[224]	train-logloss:0.006242	eval-logloss:0.08586
[225]	train-logloss:0.00623	eval-logloss:0.085871
[226]	train-logloss:0.006218	eval-logloss:0.085927
[227]	train-logloss:0.006206	eval-logloss:0.085954
[228]	train-logloss:0.006194	eval-logloss:0.085874
[229]	train-logloss:0.006182	eval-logloss:0.086057
[230]	train-logloss:0.00617	eval-logloss:0.086002
[231]	train-logloss:0.006158	eval-logloss:0.085922
[232]	train-logloss:0.006147	eval-logloss:0.086102
[233]	train-logloss:0.006135	eval-logloss:0.086115
[234]	train-logloss:0.006124	eval-logloss:0.086169
[235]	train-logloss:0.006112	eval-logloss:0.086263
[236]	train-logloss:0.006101	eval-logloss:0.086291
[237]	train-logloss:0.00609	eval-logloss:0.086217
[238]	train-logloss:0.006079	eval-logloss:0.086395
[239]	train-logloss:0.006068	eval-logloss:0.086342
[240]	train-logloss:0.006057	eval-logloss:0.08618
[241]	train-logloss:0.006046	eval-logloss:0.086195
[242]	train-logloss:0.006036	eval-logloss:0.086248
[243]	train-logloss:0.006025	eval-logloss:0.086263
[244]	train-logloss:0.006014	eval-logloss:0.086293
[245]	train-logloss:0.006004	eval-logloss:0.086222
[246]	train-logloss:0.005993	eval-logloss:0.086398
[247]	train-logloss:0.005983	eval-logloss:0.086347
[248]	train-logloss:0.005972	eval-logloss:0.086276
[249]	train-logloss:0.005962	eval-logloss:0.086448
[250]	train-logloss:0.005952	eval-logloss:0.086294
[251]	train-logloss:0.005942	eval-logloss:0.086312
[252]	train-logloss:0.005932	eval-logloss:0.086364
[253]	train-logloss:0.005922	eval-logloss:0.086394
[254]	train-logloss:0.005912	eval-logloss:0.08649
[255]	train-logloss:0.005903	eval-logloss:0.086441
[256]	train-logloss:0.005893	eval-logloss:0.08629
[257]	train-logloss:0.005883	eval-logloss:0.086459
[258]	train-logloss:0.005874	eval-logloss:0.086391
[259]	train-logloss:0.005864	eval-logloss:0.086441
[260]	train-logloss:0.005855	eval-logloss:0.086461
[261]	train-logloss:0.005845	eval-logloss:0.086491
[262]	train-logloss:0.005836	eval-logloss:0.086445
[263]	train-logloss:0.005827	eval-logloss:0.086466
[264]	train-logloss:0.005818	eval-logloss:0.086319
[265]	train-logloss:0.005809	eval-logloss:0.086488
[266]	train-logloss:0.0058	eval-logloss:0.086538
[267]	train-logloss:0.005791	eval-logloss:0.086471
[268]	train-logloss:0.005782	eval-logloss:0.086501
[269]	train-logloss:0.005773	eval-logloss:0.086522
[270]	train-logloss:0.005764	eval-logloss:0.086689
[271]	train-logloss:0.005755	eval-logloss:0.086738
[272]	train-logloss:0.005747	eval-logloss:0.086829
[273]	train-logloss:0.005738	eval-logloss:0.086684
[274]	train-logloss:0.005729	eval-logloss:0.08664
[275]	train-logloss:0.005721	eval-logloss:0.086496
[276]	train-logloss:0.005712	eval-logloss:0.086355
[277]	train-logloss:0.005704	eval-logloss:0.086519
[278]	train-logloss:0.005696	eval-logloss:0.086567
[279]	train-logloss:0.005687	eval-logloss:0.08659
[280]	train-logloss:0.005679	eval-logloss:0.086679
[281]	train-logloss:0.005671	eval-logloss:0.086637
[282]	train-logloss:0.005663	eval-logloss:0.086499
[283]	train-logloss:0.005655	eval-logloss:0.086356
[284]	train-logloss:0.005646	eval-logloss:0.086405
[285]	train-logloss:0.005639	eval-logloss:0.086429
[286]	train-logloss:0.005631	eval-logloss:0.086456
[287]	train-logloss:0.005623	eval-logloss:0.086504
[288]	train-logloss:0.005615	eval-logloss:0.08637
[289]	train-logloss:0.005608	eval-logloss:0.086457
[290]	train-logloss:0.0056	eval-logloss:0.086453
[291]	train-logloss:0.005593	eval-logloss:0.086322
[292]	train-logloss:0.005585	eval-logloss:0.086284
[293]	train-logloss:0.005577	eval-logloss:0.086148
[294]	train-logloss:0.00557	eval-logloss:0.086196
[295]	train-logloss:0.005563	eval-logloss:0.086221
[296]	train-logloss:0.005556	eval-logloss:0.086308
[297]	train-logloss:0.005548	eval-logloss:0.086178
[298]	train-logloss:0.005541	eval-logloss:0.086263
[299]	train-logloss:0.005534	eval-logloss:0.086131
[300]	train-logloss:0.005526	eval-logloss:0.086179
[301]	train-logloss:0.005519	eval-logloss:0.086052
[302]	train-logloss:0.005512	eval-logloss:0.086016
[303]	train-logloss:0.005505	eval-logloss:0.086101
[304]	train-logloss:0.005498	eval-logloss:0.085977
[305]	train-logloss:0.005491	eval-logloss:0.086059
[306]	train-logloss:0.005484	eval-logloss:0.085971
[307]	train-logloss:0.005478	eval-logloss:0.085998
[308]	train-logloss:0.005471	eval-logloss:0.085998
[309]	train-logloss:0.005464	eval-logloss:0.085877
[310]	train-logloss:0.005457	eval-logloss:0.085923
[311]	train-logloss:0.00545	eval-logloss:0.085948
[312]	train-logloss:0.005444	eval-logloss:0.086028
[313]	train-logloss:0.005437	eval-logloss:0.086112
[314]	train-logloss:0.00543	eval-logloss:0.085989
[315]	train-logloss:0.005424	eval-logloss:0.085903
[316]	train-logloss:0.005417	eval-logloss:0.085949
[317]	train-logloss:0.005411	eval-logloss:0.085977
[318]	train-logloss:0.005404	eval-logloss:0.086002
[319]	train-logloss:0.005398	eval-logloss:0.085883
[320]	train-logloss:0.005392	eval-logloss:0.085967
[321]	train-logloss:0.005385	eval-logloss:0.086046
[322]	train-logloss:0.005379	eval-logloss:0.086091
[323]	train-logloss:0.005373	eval-logloss:0.085977
[324]	train-logloss:0.005366	eval-logloss:0.085978
[325]	train-logloss:0.00536	eval-logloss:0.085896
[326]	train-logloss:0.005354	eval-logloss:0.08578
[327]	train-logloss:0.005348	eval-logloss:0.085857
[328]	train-logloss:0.005342	eval-logloss:0.085939
[329]	train-logloss:0.005336	eval-logloss:0.085825
[330]	train-logloss:0.00533	eval-logloss:0.085869
[331]	train-logloss:0.005324	eval-logloss:0.085893
[332]	train-logloss:0.005318	eval-logloss:0.085922
[333]	train-logloss:0.005312	eval-logloss:0.085842
[334]	train-logloss:0.005306	eval-logloss:0.085735
[335]	train-logloss:0.0053	eval-logloss:0.085816
[336]	train-logloss:0.005294	eval-logloss:0.085892
[337]	train-logloss:0.005288	eval-logloss:0.085936
[338]	train-logloss:0.005283	eval-logloss:0.08583
[339]	train-logloss:0.005277	eval-logloss:0.085909
[340]	train-logloss:0.005271	eval-logloss:0.085831
[341]	train-logloss:0.005265	eval-logloss:0.085727
[342]	train-logloss:0.00526	eval-logloss:0.085678
[343]	train-logloss:0.005254	eval-logloss:0.085721
[344]	train-logloss:0.005249	eval-logloss:0.085796
[345]	train-logloss:0.005243	eval-logloss:0.085819
[346]	train-logloss:0.005237	eval-logloss:0.085715
[347]	train-logloss:0.005232	eval-logloss:0.085793
[348]	train-logloss:0.005227	eval-logloss:0.085835
[349]	train-logloss:0.005221	eval-logloss:0.085734
[350]	train-logloss:0.005216	eval-logloss:0.085658
[351]	train-logloss:0.00521	eval-logloss:0.08573
[352]	train-logloss:0.005205	eval-logloss:0.085807
[353]	train-logloss:0.0052	eval-logloss:0.085706
[354]	train-logloss:0.005195	eval-logloss:0.085659
[355]	train-logloss:0.005189	eval-logloss:0.085701
[356]	train-logloss:0.005184	eval-logloss:0.085628
[357]	train-logloss:0.005179	eval-logloss:0.085529
[358]	train-logloss:0.005174	eval-logloss:0.085604
[359]	train-logloss:0.005169	eval-logloss:0.085676
[360]	train-logloss:0.005164	eval-logloss:0.085579
[361]	train-logloss:0.005159	eval-logloss:0.085601
[362]	train-logloss:0.005153	eval-logloss:0.085643
[363]	train-logloss:0.005149	eval-logloss:0.085713
[364]	train-logloss:0.005144	eval-logloss:0.085787
[365]	train-logloss:0.005139	eval-logloss:0.085689
[366]	train-logloss:0.005134	eval-logloss:0.08573
[367]	train-logloss:0.005129	eval-logloss:0.085684
[368]	train-logloss:0.005124	eval-logloss:0.085589
[369]	train-logloss:0.005119	eval-logloss:0.085516
[370]	train-logloss:0.005114	eval-logloss:0.085588
[371]	train-logloss:0.00511	eval-logloss:0.085495
[372]	train-logloss:0.005105	eval-logloss:0.085564
[373]	train-logloss:0.0051	eval-logloss:0.085605
[374]	train-logloss:0.005096	eval-logloss:0.085626
[375]	train-logloss:0.005091	eval-logloss:0.085535
[376]	train-logloss:0.005086	eval-logloss:0.085606
[377]	train-logloss:0.005082	eval-logloss:0.085674
[378]	train-logloss:0.005077	eval-logloss:0.085714
[379]	train-logloss:0.005073	eval-logloss:0.085624
[380]	train-logloss:0.005068	eval-logloss:0.085579
[381]	train-logloss:0.005064	eval-logloss:0.085619
[382]	train-logloss:0.00506	eval-logloss:0.085638
[383]	train-logloss:0.005055	eval-logloss:0.08555
[384]	train-logloss:0.005051	eval-logloss:0.085617
[385]	train-logloss:0.005047	eval-logloss:0.085621
[386]	train-logloss:0.005042	eval-logloss:0.085551
[387]	train-logloss:0.005038	eval-logloss:0.085463
[388]	train-logloss:0.005034	eval-logloss:0.085502
[389]	train-logloss:0.005029	eval-logloss:0.085459
[390]	train-logloss:0.005025	eval-logloss:0.085321
[391]	train-logloss:0.005021	eval-logloss:0.085389
[392]	train-logloss:0.005017	eval-logloss:0.085303
[393]	train-logloss:0.005013	eval-logloss:0.085369
[394]	train-logloss:0.005009	eval-logloss:0.085301
[395]	train-logloss:0.005005	eval-logloss:0.085368
[396]	train-logloss:0.005	eval-logloss:0.085283
[397]	train-logloss:0.004996	eval-logloss:0.08532
[398]	train-logloss:0.004992	eval-logloss:0.085279
[399]	train-logloss:0.004988	eval-logloss:0.085196

 

predict()를 통해 예측 확률값을 반환하고 예측 값으로 변환

pred_probs = xgb_model.predict(dtest)
print('predict( ) 수행 결과값을 10개만 표시, 예측 확률 값으로 표시됨')
print(np.round(pred_probs[:10],3))

# 예측 확률이 0.5 보다 크면 1 , 그렇지 않으면 0 으로 예측값 결정하여 List 객체인 preds에 저장 
preds = [ 1 if x > 0.5 else 0 for x in pred_probs ]
print('예측값 10개만 표시:',preds[:10])



# predict( ) 수행 결과값을 10개만 표시, 예측 확률 값으로 표시됨
# [0.95  0.003 0.9   0.086 0.993 1.    1.    0.999 0.998 0.   ]
# 예측값 10개만 표시: [1, 0, 1, 0, 1, 1, 1, 1, 1, 0]

 

get_clf_eval( )을 통해 예측 평가

from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score, roc_auc_score

# 수정된 get_clf_eval() 함수 
def get_clf_eval(y_test, pred=None, pred_proba=None):
    confusion = confusion_matrix( y_test, pred)
    accuracy = accuracy_score(y_test , pred)
    precision = precision_score(y_test , pred)
    recall = recall_score(y_test , pred)
    f1 = f1_score(y_test,pred)
    # ROC-AUC 추가 
    roc_auc = roc_auc_score(y_test, pred_proba)
    print('오차 행렬')
    print(confusion)
    # ROC-AUC print 추가
    print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f},\
    F1: {3:.4f}, AUC:{4:.4f}'.format(accuracy, precision, recall, f1, roc_auc))
    


get_clf_eval(y_test , preds, pred_probs)

오차 행렬
[[35  2]
 [ 1 76]]
정확도: 0.9737, 정밀도: 0.9744, 재현율: 0.9870,    F1: 0.9806, AUC:0.9951

 

Feature Importance 시각화

import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots(figsize=(10, 12))
plot_importance(xgb_model, ax=ax)

 

사이킷런 Wrapper XGBoost 개요 및 적용

** 사이킷런 래퍼 클래스 임포트, 학습 및 예측 

# 사이킷런 래퍼 XGBoost 클래스인 XGBClassifier 임포트
from xgboost import XGBClassifier

evals = [(X_test, y_test)]

xgb_wrapper = XGBClassifier(n_estimators=400, learning_rate=0.1, max_depth=3)
xgb_wrapper.fit(X_train , y_train,  early_stopping_rounds=400,eval_set=evals, eval_metric="logloss",  verbose=True)

w_preds = xgb_wrapper.predict(X_test)
w_pred_proba = xgb_wrapper.predict_proba(X_test)[:, 1]

[0]	validation_0-logloss:0.61352
Will train until validation_0-logloss hasn't improved in 400 rounds.
[1]	validation_0-logloss:0.547842
[2]	validation_0-logloss:0.494248
[3]	validation_0-logloss:0.447986
[4]	validation_0-logloss:0.409109
[5]	validation_0-logloss:0.374977
[6]	validation_0-logloss:0.345714
[7]	validation_0-logloss:0.320529
[8]	validation_0-logloss:0.29721
[9]	validation_0-logloss:0.277991
[10]	validation_0-logloss:0.260302
[11]	validation_0-logloss:0.246037
[12]	validation_0-logloss:0.231556
[13]	validation_0-logloss:0.22005
[14]	validation_0-logloss:0.208572
[15]	validation_0-logloss:0.199993
[16]	validation_0-logloss:0.190118
[17]	validation_0-logloss:0.181818
[18]	validation_0-logloss:0.174729
[19]	validation_0-logloss:0.167657
[20]	validation_0-logloss:0.158202
[21]	validation_0-logloss:0.154725
[22]	validation_0-logloss:0.148947
[23]	validation_0-logloss:0.143308
[24]	validation_0-logloss:0.136344
[25]	validation_0-logloss:0.132778
[26]	validation_0-logloss:0.127912
[27]	validation_0-logloss:0.125263
[28]	validation_0-logloss:0.119978
[29]	validation_0-logloss:0.116412
[30]	validation_0-logloss:0.114502
[31]	validation_0-logloss:0.112572
[32]	validation_0-logloss:0.11154
[33]	validation_0-logloss:0.108681
[34]	validation_0-logloss:0.106681
[35]	validation_0-logloss:0.104207
[36]	validation_0-logloss:0.102962
[37]	validation_0-logloss:0.100576
[38]	validation_0-logloss:0.098683
[39]	validation_0-logloss:0.096444
[40]	validation_0-logloss:0.095869
[41]	validation_0-logloss:0.094242
[42]	validation_0-logloss:0.094715
[43]	validation_0-logloss:0.094272
[44]	validation_0-logloss:0.093894
[45]	validation_0-logloss:0.094184
[46]	validation_0-logloss:0.09402
[47]	validation_0-logloss:0.09236
[48]	validation_0-logloss:0.093012
[49]	validation_0-logloss:0.091273
[50]	validation_0-logloss:0.090051
[51]	validation_0-logloss:0.089605
[52]	validation_0-logloss:0.089577
[53]	validation_0-logloss:0.090703
[54]	validation_0-logloss:0.089579
[55]	validation_0-logloss:0.090357
[56]	validation_0-logloss:0.091587
[57]	validation_0-logloss:0.091527
[58]	validation_0-logloss:0.091986
[59]	validation_0-logloss:0.091951
[60]	validation_0-logloss:0.091939
[61]	validation_0-logloss:0.091461
[62]	validation_0-logloss:0.090311
[63]	validation_0-logloss:0.089407
[64]	validation_0-logloss:0.089719
[65]	validation_0-logloss:0.089743
[66]	validation_0-logloss:0.089622
[67]	validation_0-logloss:0.088734
[68]	validation_0-logloss:0.088621
[69]	validation_0-logloss:0.089739
[70]	validation_0-logloss:0.089981
[71]	validation_0-logloss:0.089782
[72]	validation_0-logloss:0.089584
[73]	validation_0-logloss:0.089533
[74]	validation_0-logloss:0.088748
[75]	validation_0-logloss:0.088597
[76]	validation_0-logloss:0.08812
[77]	validation_0-logloss:0.088396
[78]	validation_0-logloss:0.088736
[79]	validation_0-logloss:0.088153
[80]	validation_0-logloss:0.087577
[81]	validation_0-logloss:0.087412
[82]	validation_0-logloss:0.08849
[83]	validation_0-logloss:0.088575
[84]	validation_0-logloss:0.08807
[85]	validation_0-logloss:0.087641
[86]	validation_0-logloss:0.087416
[87]	validation_0-logloss:0.087611
[88]	validation_0-logloss:0.087065
[89]	validation_0-logloss:0.08727
[90]	validation_0-logloss:0.087161
[91]	validation_0-logloss:0.086962
[92]	validation_0-logloss:0.087166
[93]	validation_0-logloss:0.087067
[94]	validation_0-logloss:0.086592
[95]	validation_0-logloss:0.086116
[96]	validation_0-logloss:0.087139
[97]	validation_0-logloss:0.086768
[98]	validation_0-logloss:0.086694
[99]	validation_0-logloss:0.086547
[100]	validation_0-logloss:0.086498
[101]	validation_0-logloss:0.08641
[102]	validation_0-logloss:0.086288
[103]	validation_0-logloss:0.086258
[104]	validation_0-logloss:0.086835
[105]	validation_0-logloss:0.086767
[106]	validation_0-logloss:0.087321
[107]	validation_0-logloss:0.087304
[108]	validation_0-logloss:0.08728
[109]	validation_0-logloss:0.087298
[110]	validation_0-logloss:0.087289
[111]	validation_0-logloss:0.088002
[112]	validation_0-logloss:0.087936
[113]	validation_0-logloss:0.087843
[114]	validation_0-logloss:0.088066
[115]	validation_0-logloss:0.087649
[116]	validation_0-logloss:0.087298
[117]	validation_0-logloss:0.087799
[118]	validation_0-logloss:0.087751
[119]	validation_0-logloss:0.08768
[120]	validation_0-logloss:0.087626
[121]	validation_0-logloss:0.08757
[122]	validation_0-logloss:0.087547
[123]	validation_0-logloss:0.087156
[124]	validation_0-logloss:0.08767
[125]	validation_0-logloss:0.087737
[126]	validation_0-logloss:0.088275
[127]	validation_0-logloss:0.088309
[128]	validation_0-logloss:0.088266
[129]	validation_0-logloss:0.087886
[130]	validation_0-logloss:0.088861
[131]	validation_0-logloss:0.088675
[132]	validation_0-logloss:0.088743
[133]	validation_0-logloss:0.089218
[134]	validation_0-logloss:0.089179
[135]	validation_0-logloss:0.088821
[136]	validation_0-logloss:0.088512
[137]	validation_0-logloss:0.08848
[138]	validation_0-logloss:0.088386
[139]	validation_0-logloss:0.089145
[140]	validation_0-logloss:0.08911
[141]	validation_0-logloss:0.088765
[142]	validation_0-logloss:0.088678
[143]	validation_0-logloss:0.088389
[144]	validation_0-logloss:0.089271
[145]	validation_0-logloss:0.089238
[146]	validation_0-logloss:0.089139
[147]	validation_0-logloss:0.088907
[148]	validation_0-logloss:0.089416
[149]	validation_0-logloss:0.089388
[150]	validation_0-logloss:0.089108
[151]	validation_0-logloss:0.088735
[152]	validation_0-logloss:0.088717
[153]	validation_0-logloss:0.088484
[154]	validation_0-logloss:0.088471
[155]	validation_0-logloss:0.088545
[156]	validation_0-logloss:0.088521
[157]	validation_0-logloss:0.088547
[158]	validation_0-logloss:0.088275
[159]	validation_0-logloss:0.0883
[160]	validation_0-logloss:0.08828
[161]	validation_0-logloss:0.088013
[162]	validation_0-logloss:0.087758
[163]	validation_0-logloss:0.087784
[164]	validation_0-logloss:0.087777
[165]	validation_0-logloss:0.087517
[166]	validation_0-logloss:0.087542
[167]	validation_0-logloss:0.087642
[168]	validation_0-logloss:0.08739
[169]	validation_0-logloss:0.087377
[170]	validation_0-logloss:0.087298
[171]	validation_0-logloss:0.087368
[172]	validation_0-logloss:0.087395
[173]	validation_0-logloss:0.087385
[174]	validation_0-logloss:0.087132
[175]	validation_0-logloss:0.087159
[176]	validation_0-logloss:0.086955
[177]	validation_0-logloss:0.087053
[178]	validation_0-logloss:0.08697
[179]	validation_0-logloss:0.086973
[180]	validation_0-logloss:0.087038
[181]	validation_0-logloss:0.086799
[182]	validation_0-logloss:0.086826
[183]	validation_0-logloss:0.086582
[184]	validation_0-logloss:0.086588
[185]	validation_0-logloss:0.086614
[186]	validation_0-logloss:0.086372
[187]	validation_0-logloss:0.086369
[188]	validation_0-logloss:0.086297
[189]	validation_0-logloss:0.086104
[190]	validation_0-logloss:0.086023
[191]	validation_0-logloss:0.08605
[192]	validation_0-logloss:0.086149
[193]	validation_0-logloss:0.085916
[194]	validation_0-logloss:0.085915
[195]	validation_0-logloss:0.085984
[196]	validation_0-logloss:0.086012
[197]	validation_0-logloss:0.085922
[198]	validation_0-logloss:0.085853
[199]	validation_0-logloss:0.085874
[200]	validation_0-logloss:0.085888
[201]	validation_0-logloss:0.08595
[202]	validation_0-logloss:0.08573
[203]	validation_0-logloss:0.08573
[204]	validation_0-logloss:0.085753
[205]	validation_0-logloss:0.085821
[206]	validation_0-logloss:0.08584
[207]	validation_0-logloss:0.085776
[208]	validation_0-logloss:0.085686
[209]	validation_0-logloss:0.08571
[210]	validation_0-logloss:0.085806
[211]	validation_0-logloss:0.085593
[212]	validation_0-logloss:0.085801
[213]	validation_0-logloss:0.085807
[214]	validation_0-logloss:0.085744
[215]	validation_0-logloss:0.085658
[216]	validation_0-logloss:0.085843
[217]	validation_0-logloss:0.085632
[218]	validation_0-logloss:0.085726
[219]	validation_0-logloss:0.085783
[220]	validation_0-logloss:0.085791
[221]	validation_0-logloss:0.085817
[222]	validation_0-logloss:0.085757
[223]	validation_0-logloss:0.085674
[224]	validation_0-logloss:0.08586
[225]	validation_0-logloss:0.085871
[226]	validation_0-logloss:0.085927
[227]	validation_0-logloss:0.085954
[228]	validation_0-logloss:0.085874
[229]	validation_0-logloss:0.086057
[230]	validation_0-logloss:0.086002
[231]	validation_0-logloss:0.085922
[232]	validation_0-logloss:0.086102
[233]	validation_0-logloss:0.086115
[234]	validation_0-logloss:0.086169
[235]	validation_0-logloss:0.086263
[236]	validation_0-logloss:0.086291
[237]	validation_0-logloss:0.086217
[238]	validation_0-logloss:0.086395
[239]	validation_0-logloss:0.086342
[240]	validation_0-logloss:0.08618
[241]	validation_0-logloss:0.086195
[242]	validation_0-logloss:0.086248
[243]	validation_0-logloss:0.086263
[244]	validation_0-logloss:0.086293
[245]	validation_0-logloss:0.086222
[246]	validation_0-logloss:0.086398
[247]	validation_0-logloss:0.086347
[248]	validation_0-logloss:0.086276
[249]	validation_0-logloss:0.086448
[250]	validation_0-logloss:0.086294
[251]	validation_0-logloss:0.086312
[252]	validation_0-logloss:0.086364
[253]	validation_0-logloss:0.086394
[254]	validation_0-logloss:0.08649
[255]	validation_0-logloss:0.086441
[256]	validation_0-logloss:0.08629
[257]	validation_0-logloss:0.086459
[258]	validation_0-logloss:0.086391
[259]	validation_0-logloss:0.086441
[260]	validation_0-logloss:0.086461
[261]	validation_0-logloss:0.086491
[262]	validation_0-logloss:0.086445
[263]	validation_0-logloss:0.086466
[264]	validation_0-logloss:0.086319
[265]	validation_0-logloss:0.086488
[266]	validation_0-logloss:0.086538
[267]	validation_0-logloss:0.086471
[268]	validation_0-logloss:0.086501
[269]	validation_0-logloss:0.086522
[270]	validation_0-logloss:0.086689
[271]	validation_0-logloss:0.086738
[272]	validation_0-logloss:0.086829
[273]	validation_0-logloss:0.086684
[274]	validation_0-logloss:0.08664
[275]	validation_0-logloss:0.086496
[276]	validation_0-logloss:0.086355
[277]	validation_0-logloss:0.086519
[278]	validation_0-logloss:0.086567
[279]	validation_0-logloss:0.08659
[280]	validation_0-logloss:0.086679
[281]	validation_0-logloss:0.086637
[282]	validation_0-logloss:0.086499
[283]	validation_0-logloss:0.086356
[284]	validation_0-logloss:0.086405
[285]	validation_0-logloss:0.086429
[286]	validation_0-logloss:0.086456
[287]	validation_0-logloss:0.086504
[288]	validation_0-logloss:0.08637
[289]	validation_0-logloss:0.086457
[290]	validation_0-logloss:0.086453
[291]	validation_0-logloss:0.086322
[292]	validation_0-logloss:0.086284
[293]	validation_0-logloss:0.086148
[294]	validation_0-logloss:0.086196
[295]	validation_0-logloss:0.086221
[296]	validation_0-logloss:0.086308
[297]	validation_0-logloss:0.086178
[298]	validation_0-logloss:0.086263
[299]	validation_0-logloss:0.086131
[300]	validation_0-logloss:0.086179
[301]	validation_0-logloss:0.086052
[302]	validation_0-logloss:0.086016
[303]	validation_0-logloss:0.086101
[304]	validation_0-logloss:0.085977
[305]	validation_0-logloss:0.086059
[306]	validation_0-logloss:0.085971
[307]	validation_0-logloss:0.085998
[308]	validation_0-logloss:0.085998
[309]	validation_0-logloss:0.085877
[310]	validation_0-logloss:0.085923
[311]	validation_0-logloss:0.085948
[312]	validation_0-logloss:0.086028
[313]	validation_0-logloss:0.086112
[314]	validation_0-logloss:0.085989
[315]	validation_0-logloss:0.085903
[316]	validation_0-logloss:0.085949
[317]	validation_0-logloss:0.085977
[318]	validation_0-logloss:0.086002
[319]	validation_0-logloss:0.085883
[320]	validation_0-logloss:0.085967
[321]	validation_0-logloss:0.086046
[322]	validation_0-logloss:0.086091
[323]	validation_0-logloss:0.085977
[324]	validation_0-logloss:0.085978
[325]	validation_0-logloss:0.085896
[326]	validation_0-logloss:0.08578
[327]	validation_0-logloss:0.085857
[328]	validation_0-logloss:0.085939
[329]	validation_0-logloss:0.085825
[330]	validation_0-logloss:0.085869
[331]	validation_0-logloss:0.085893
[332]	validation_0-logloss:0.085922
[333]	validation_0-logloss:0.085842
[334]	validation_0-logloss:0.085735
[335]	validation_0-logloss:0.085816
[336]	validation_0-logloss:0.085892
[337]	validation_0-logloss:0.085936
[338]	validation_0-logloss:0.08583
[339]	validation_0-logloss:0.085909
[340]	validation_0-logloss:0.085831
[341]	validation_0-logloss:0.085727
[342]	validation_0-logloss:0.085678
[343]	validation_0-logloss:0.085721
[344]	validation_0-logloss:0.085796
[345]	validation_0-logloss:0.085819
[346]	validation_0-logloss:0.085715
[347]	validation_0-logloss:0.085793
[348]	validation_0-logloss:0.085835
[349]	validation_0-logloss:0.085734
[350]	validation_0-logloss:0.085658
[351]	validation_0-logloss:0.08573
[352]	validation_0-logloss:0.085807
[353]	validation_0-logloss:0.085706
[354]	validation_0-logloss:0.085659
[355]	validation_0-logloss:0.085701
[356]	validation_0-logloss:0.085628
[357]	validation_0-logloss:0.085529
[358]	validation_0-logloss:0.085604
[359]	validation_0-logloss:0.085676
[360]	validation_0-logloss:0.085579
[361]	validation_0-logloss:0.085601
[362]	validation_0-logloss:0.085643
[363]	validation_0-logloss:0.085713
[364]	validation_0-logloss:0.085787
[365]	validation_0-logloss:0.085689
[366]	validation_0-logloss:0.08573
[367]	validation_0-logloss:0.085684
[368]	validation_0-logloss:0.085589
[369]	validation_0-logloss:0.085516
[370]	validation_0-logloss:0.085588
[371]	validation_0-logloss:0.085495
[372]	validation_0-logloss:0.085564
[373]	validation_0-logloss:0.085605
[374]	validation_0-logloss:0.085626
[375]	validation_0-logloss:0.085535
[376]	validation_0-logloss:0.085606
[377]	validation_0-logloss:0.085674
[378]	validation_0-logloss:0.085714
[379]	validation_0-logloss:0.085624
[380]	validation_0-logloss:0.085579
[381]	validation_0-logloss:0.085619
[382]	validation_0-logloss:0.085638
[383]	validation_0-logloss:0.08555
[384]	validation_0-logloss:0.085617
[385]	validation_0-logloss:0.085621
[386]	validation_0-logloss:0.085551
[387]	validation_0-logloss:0.085463
[388]	validation_0-logloss:0.085502
[389]	validation_0-logloss:0.085459
[390]	validation_0-logloss:0.085321
[391]	validation_0-logloss:0.085389
[392]	validation_0-logloss:0.085303
[393]	validation_0-logloss:0.085369
[394]	validation_0-logloss:0.085301
[395]	validation_0-logloss:0.085368
[396]	validation_0-logloss:0.085283
[397]	validation_0-logloss:0.08532
[398]	validation_0-logloss:0.085279
[399]	validation_0-logloss:0.085196

 

get_clf_eval(y_test , w_preds, w_pred_proba)

오차 행렬
[[35  2]
 [ 1 76]]
정확도: 0.9737, 정밀도: 0.9744, 재현율: 0.9870,    F1: 0.9806, AUC:0.9951

 

early stopping을 100으로 설정하고 재 학습/예측/평가

from xgboost import XGBClassifier

xgb_wrapper = XGBClassifier(n_estimators=400, learning_rate=0.1, max_depth=3)

evals = [(X_test, y_test)]
xgb_wrapper.fit(X_train, y_train, early_stopping_rounds=100, eval_metric="logloss", 
                eval_set=evals, verbose=True)

ws100_preds = xgb_wrapper.predict(X_test)
ws100_pred_proba = xgb_wrapper.predict_proba(X_test)[:, 1]

[0]	validation_0-logloss:0.61352
Will train until validation_0-logloss hasn't improved in 100 rounds.
[1]	validation_0-logloss:0.547842
[2]	validation_0-logloss:0.494248
[3]	validation_0-logloss:0.447986
[4]	validation_0-logloss:0.409109
[5]	validation_0-logloss:0.374977
[6]	validation_0-logloss:0.345714
[7]	validation_0-logloss:0.320529
[8]	validation_0-logloss:0.29721
[9]	validation_0-logloss:0.277991
[10]	validation_0-logloss:0.260302
[11]	validation_0-logloss:0.246037
[12]	validation_0-logloss:0.231556
[13]	validation_0-logloss:0.22005
[14]	validation_0-logloss:0.208572
[15]	validation_0-logloss:0.199993
[16]	validation_0-logloss:0.190118
[17]	validation_0-logloss:0.181818
[18]	validation_0-logloss:0.174729
[19]	validation_0-logloss:0.167657
[20]	validation_0-logloss:0.158202
[21]	validation_0-logloss:0.154725
[22]	validation_0-logloss:0.148947
[23]	validation_0-logloss:0.143308
[24]	validation_0-logloss:0.136344
[25]	validation_0-logloss:0.132778
[26]	validation_0-logloss:0.127912
[27]	validation_0-logloss:0.125263
[28]	validation_0-logloss:0.119978
[29]	validation_0-logloss:0.116412
[30]	validation_0-logloss:0.114502
[31]	validation_0-logloss:0.112572
[32]	validation_0-logloss:0.11154
[33]	validation_0-logloss:0.108681
[34]	validation_0-logloss:0.106681
[35]	validation_0-logloss:0.104207
[36]	validation_0-logloss:0.102962
[37]	validation_0-logloss:0.100576
[38]	validation_0-logloss:0.098683
[39]	validation_0-logloss:0.096444
[40]	validation_0-logloss:0.095869
[41]	validation_0-logloss:0.094242
[42]	validation_0-logloss:0.094715
[43]	validation_0-logloss:0.094272
[44]	validation_0-logloss:0.093894
[45]	validation_0-logloss:0.094184
[46]	validation_0-logloss:0.09402
[47]	validation_0-logloss:0.09236
[48]	validation_0-logloss:0.093012
[49]	validation_0-logloss:0.091273
[50]	validation_0-logloss:0.090051
[51]	validation_0-logloss:0.089605
[52]	validation_0-logloss:0.089577
[53]	validation_0-logloss:0.090703
[54]	validation_0-logloss:0.089579
[55]	validation_0-logloss:0.090357
[56]	validation_0-logloss:0.091587
[57]	validation_0-logloss:0.091527
[58]	validation_0-logloss:0.091986
[59]	validation_0-logloss:0.091951
[60]	validation_0-logloss:0.091939
[61]	validation_0-logloss:0.091461
[62]	validation_0-logloss:0.090311
[63]	validation_0-logloss:0.089407
[64]	validation_0-logloss:0.089719
[65]	validation_0-logloss:0.089743
[66]	validation_0-logloss:0.089622
[67]	validation_0-logloss:0.088734
[68]	validation_0-logloss:0.088621
[69]	validation_0-logloss:0.089739
[70]	validation_0-logloss:0.089981
[71]	validation_0-logloss:0.089782
[72]	validation_0-logloss:0.089584
[73]	validation_0-logloss:0.089533
[74]	validation_0-logloss:0.088748
[75]	validation_0-logloss:0.088597
[76]	validation_0-logloss:0.08812
[77]	validation_0-logloss:0.088396
[78]	validation_0-logloss:0.088736
[79]	validation_0-logloss:0.088153
[80]	validation_0-logloss:0.087577
[81]	validation_0-logloss:0.087412
[82]	validation_0-logloss:0.08849
[83]	validation_0-logloss:0.088575
[84]	validation_0-logloss:0.08807
[85]	validation_0-logloss:0.087641
[86]	validation_0-logloss:0.087416
[87]	validation_0-logloss:0.087611
[88]	validation_0-logloss:0.087065
[89]	validation_0-logloss:0.08727
[90]	validation_0-logloss:0.087161
[91]	validation_0-logloss:0.086962
[92]	validation_0-logloss:0.087166
[93]	validation_0-logloss:0.087067
[94]	validation_0-logloss:0.086592
[95]	validation_0-logloss:0.086116
[96]	validation_0-logloss:0.087139
[97]	validation_0-logloss:0.086768
[98]	validation_0-logloss:0.086694
[99]	validation_0-logloss:0.086547
[100]	validation_0-logloss:0.086498
[101]	validation_0-logloss:0.08641
[102]	validation_0-logloss:0.086288
[103]	validation_0-logloss:0.086258
[104]	validation_0-logloss:0.086835
[105]	validation_0-logloss:0.086767
[106]	validation_0-logloss:0.087321
[107]	validation_0-logloss:0.087304
[108]	validation_0-logloss:0.08728
[109]	validation_0-logloss:0.087298
[110]	validation_0-logloss:0.087289
[111]	validation_0-logloss:0.088002
[112]	validation_0-logloss:0.087936
[113]	validation_0-logloss:0.087843
[114]	validation_0-logloss:0.088066
[115]	validation_0-logloss:0.087649
[116]	validation_0-logloss:0.087298
[117]	validation_0-logloss:0.087799
[118]	validation_0-logloss:0.087751
[119]	validation_0-logloss:0.08768
[120]	validation_0-logloss:0.087626
[121]	validation_0-logloss:0.08757
[122]	validation_0-logloss:0.087547
[123]	validation_0-logloss:0.087156
[124]	validation_0-logloss:0.08767
[125]	validation_0-logloss:0.087737
[126]	validation_0-logloss:0.088275
[127]	validation_0-logloss:0.088309
[128]	validation_0-logloss:0.088266
[129]	validation_0-logloss:0.087886
[130]	validation_0-logloss:0.088861
[131]	validation_0-logloss:0.088675
[132]	validation_0-logloss:0.088743
[133]	validation_0-logloss:0.089218
[134]	validation_0-logloss:0.089179
[135]	validation_0-logloss:0.088821
[136]	validation_0-logloss:0.088512
[137]	validation_0-logloss:0.08848
[138]	validation_0-logloss:0.088386
[139]	validation_0-logloss:0.089145
[140]	validation_0-logloss:0.08911
[141]	validation_0-logloss:0.088765
[142]	validation_0-logloss:0.088678
[143]	validation_0-logloss:0.088389
[144]	validation_0-logloss:0.089271
[145]	validation_0-logloss:0.089238
[146]	validation_0-logloss:0.089139
[147]	validation_0-logloss:0.088907
[148]	validation_0-logloss:0.089416
[149]	validation_0-logloss:0.089388
[150]	validation_0-logloss:0.089108
[151]	validation_0-logloss:0.088735
[152]	validation_0-logloss:0.088717
[153]	validation_0-logloss:0.088484
[154]	validation_0-logloss:0.088471
[155]	validation_0-logloss:0.088545
[156]	validation_0-logloss:0.088521
[157]	validation_0-logloss:0.088547
[158]	validation_0-logloss:0.088275
[159]	validation_0-logloss:0.0883
[160]	validation_0-logloss:0.08828
[161]	validation_0-logloss:0.088013
[162]	validation_0-logloss:0.087758
[163]	validation_0-logloss:0.087784
[164]	validation_0-logloss:0.087777
[165]	validation_0-logloss:0.087517
[166]	validation_0-logloss:0.087542
[167]	validation_0-logloss:0.087642
[168]	validation_0-logloss:0.08739
[169]	validation_0-logloss:0.087377
[170]	validation_0-logloss:0.087298
[171]	validation_0-logloss:0.087368
[172]	validation_0-logloss:0.087395
[173]	validation_0-logloss:0.087385
[174]	validation_0-logloss:0.087132
[175]	validation_0-logloss:0.087159
[176]	validation_0-logloss:0.086955
[177]	validation_0-logloss:0.087053
[178]	validation_0-logloss:0.08697
[179]	validation_0-logloss:0.086973
[180]	validation_0-logloss:0.087038
[181]	validation_0-logloss:0.086799
[182]	validation_0-logloss:0.086826
[183]	validation_0-logloss:0.086582
[184]	validation_0-logloss:0.086588
[185]	validation_0-logloss:0.086614
[186]	validation_0-logloss:0.086372
[187]	validation_0-logloss:0.086369
[188]	validation_0-logloss:0.086297
[189]	validation_0-logloss:0.086104
[190]	validation_0-logloss:0.086023
[191]	validation_0-logloss:0.08605
[192]	validation_0-logloss:0.086149
[193]	validation_0-logloss:0.085916
[194]	validation_0-logloss:0.085915
[195]	validation_0-logloss:0.085984
[196]	validation_0-logloss:0.086012
[197]	validation_0-logloss:0.085922
[198]	validation_0-logloss:0.085853
[199]	validation_0-logloss:0.085874
[200]	validation_0-logloss:0.085888
[201]	validation_0-logloss:0.08595
[202]	validation_0-logloss:0.08573
[203]	validation_0-logloss:0.08573
[204]	validation_0-logloss:0.085753
[205]	validation_0-logloss:0.085821
[206]	validation_0-logloss:0.08584
[207]	validation_0-logloss:0.085776
[208]	validation_0-logloss:0.085686
[209]	validation_0-logloss:0.08571
[210]	validation_0-logloss:0.085806
[211]	validation_0-logloss:0.085593
[212]	validation_0-logloss:0.085801
[213]	validation_0-logloss:0.085807
[214]	validation_0-logloss:0.085744
[215]	validation_0-logloss:0.085658
[216]	validation_0-logloss:0.085843
[217]	validation_0-logloss:0.085632
[218]	validation_0-logloss:0.085726
[219]	validation_0-logloss:0.085783
[220]	validation_0-logloss:0.085791
[221]	validation_0-logloss:0.085817
[222]	validation_0-logloss:0.085757
[223]	validation_0-logloss:0.085674
[224]	validation_0-logloss:0.08586
[225]	validation_0-logloss:0.085871
[226]	validation_0-logloss:0.085927
[227]	validation_0-logloss:0.085954
[228]	validation_0-logloss:0.085874
[229]	validation_0-logloss:0.086057
[230]	validation_0-logloss:0.086002
[231]	validation_0-logloss:0.085922
[232]	validation_0-logloss:0.086102
[233]	validation_0-logloss:0.086115
[234]	validation_0-logloss:0.086169
[235]	validation_0-logloss:0.086263
[236]	validation_0-logloss:0.086291
[237]	validation_0-logloss:0.086217
[238]	validation_0-logloss:0.086395
[239]	validation_0-logloss:0.086342
[240]	validation_0-logloss:0.08618
[241]	validation_0-logloss:0.086195
[242]	validation_0-logloss:0.086248
[243]	validation_0-logloss:0.086263
[244]	validation_0-logloss:0.086293
[245]	validation_0-logloss:0.086222
[246]	validation_0-logloss:0.086398
[247]	validation_0-logloss:0.086347
[248]	validation_0-logloss:0.086276
[249]	validation_0-logloss:0.086448
[250]	validation_0-logloss:0.086294
[251]	validation_0-logloss:0.086312
[252]	validation_0-logloss:0.086364
[253]	validation_0-logloss:0.086394
[254]	validation_0-logloss:0.08649
[255]	validation_0-logloss:0.086441
[256]	validation_0-logloss:0.08629
[257]	validation_0-logloss:0.086459
[258]	validation_0-logloss:0.086391
[259]	validation_0-logloss:0.086441
[260]	validation_0-logloss:0.086461
[261]	validation_0-logloss:0.086491
[262]	validation_0-logloss:0.086445
[263]	validation_0-logloss:0.086466
[264]	validation_0-logloss:0.086319
[265]	validation_0-logloss:0.086488
[266]	validation_0-logloss:0.086538
[267]	validation_0-logloss:0.086471
[268]	validation_0-logloss:0.086501
[269]	validation_0-logloss:0.086522
[270]	validation_0-logloss:0.086689
[271]	validation_0-logloss:0.086738
[272]	validation_0-logloss:0.086829
[273]	validation_0-logloss:0.086684
[274]	validation_0-logloss:0.08664
[275]	validation_0-logloss:0.086496
[276]	validation_0-logloss:0.086355
[277]	validation_0-logloss:0.086519
[278]	validation_0-logloss:0.086567
[279]	validation_0-logloss:0.08659
[280]	validation_0-logloss:0.086679
[281]	validation_0-logloss:0.086637
[282]	validation_0-logloss:0.086499
[283]	validation_0-logloss:0.086356
[284]	validation_0-logloss:0.086405
[285]	validation_0-logloss:0.086429
[286]	validation_0-logloss:0.086456
[287]	validation_0-logloss:0.086504
[288]	validation_0-logloss:0.08637
[289]	validation_0-logloss:0.086457
[290]	validation_0-logloss:0.086453
[291]	validation_0-logloss:0.086322
[292]	validation_0-logloss:0.086284
[293]	validation_0-logloss:0.086148
[294]	validation_0-logloss:0.086196
[295]	validation_0-logloss:0.086221
[296]	validation_0-logloss:0.086308
[297]	validation_0-logloss:0.086178
[298]	validation_0-logloss:0.086263
[299]	validation_0-logloss:0.086131
[300]	validation_0-logloss:0.086179
[301]	validation_0-logloss:0.086052
[302]	validation_0-logloss:0.086016
[303]	validation_0-logloss:0.086101
[304]	validation_0-logloss:0.085977
[305]	validation_0-logloss:0.086059
[306]	validation_0-logloss:0.085971
[307]	validation_0-logloss:0.085998
[308]	validation_0-logloss:0.085998
[309]	validation_0-logloss:0.085877
[310]	validation_0-logloss:0.085923
[311]	validation_0-logloss:0.085948
Stopping. Best iteration:
[211]	validation_0-logloss:0.085593

 

get_clf_eval(y_test , ws100_preds, ws100_pred_proba)

오차 행렬
[[35  2]
 [ 1 76]]
정확도: 0.9737, 정밀도: 0.9744, 재현율: 0.9870,    F1: 0.9806, AUC:0.9951

 

early stopping을 10으로 설정하고 재 학습/예측/평가

# early_stopping_rounds를 10으로 설정하고 재 학습. 
xgb_wrapper.fit(X_train, y_train, early_stopping_rounds=10, 
                eval_metric="logloss", eval_set=evals,verbose=True)

ws10_preds = xgb_wrapper.predict(X_test)
ws10_pred_proba = xgb_wrapper.predict_proba(X_test)[:, 1]


[0]	validation_0-logloss:0.61352
Will train until validation_0-logloss hasn't improved in 10 rounds.
[1]	validation_0-logloss:0.547842
[2]	validation_0-logloss:0.494248
[3]	validation_0-logloss:0.447986
[4]	validation_0-logloss:0.409109
[5]	validation_0-logloss:0.374977
[6]	validation_0-logloss:0.345714
[7]	validation_0-logloss:0.320529
[8]	validation_0-logloss:0.29721
[9]	validation_0-logloss:0.277991
[10]	validation_0-logloss:0.260302
[11]	validation_0-logloss:0.246037
[12]	validation_0-logloss:0.231556
[13]	validation_0-logloss:0.22005
[14]	validation_0-logloss:0.208572
[15]	validation_0-logloss:0.199993
[16]	validation_0-logloss:0.190118
[17]	validation_0-logloss:0.181818
[18]	validation_0-logloss:0.174729
[19]	validation_0-logloss:0.167657
[20]	validation_0-logloss:0.158202
[21]	validation_0-logloss:0.154725
[22]	validation_0-logloss:0.148947
[23]	validation_0-logloss:0.143308
[24]	validation_0-logloss:0.136344
[25]	validation_0-logloss:0.132778
[26]	validation_0-logloss:0.127912
[27]	validation_0-logloss:0.125263
[28]	validation_0-logloss:0.119978
[29]	validation_0-logloss:0.116412
[30]	validation_0-logloss:0.114502
[31]	validation_0-logloss:0.112572
[32]	validation_0-logloss:0.11154
[33]	validation_0-logloss:0.108681
[34]	validation_0-logloss:0.106681
[35]	validation_0-logloss:0.104207
[36]	validation_0-logloss:0.102962
[37]	validation_0-logloss:0.100576
[38]	validation_0-logloss:0.098683
[39]	validation_0-logloss:0.096444
[40]	validation_0-logloss:0.095869
[41]	validation_0-logloss:0.094242
[42]	validation_0-logloss:0.094715
[43]	validation_0-logloss:0.094272
[44]	validation_0-logloss:0.093894
[45]	validation_0-logloss:0.094184
[46]	validation_0-logloss:0.09402
[47]	validation_0-logloss:0.09236
[48]	validation_0-logloss:0.093012
[49]	validation_0-logloss:0.091273
[50]	validation_0-logloss:0.090051
[51]	validation_0-logloss:0.089605
[52]	validation_0-logloss:0.089577
[53]	validation_0-logloss:0.090703
[54]	validation_0-logloss:0.089579
[55]	validation_0-logloss:0.090357
[56]	validation_0-logloss:0.091587
[57]	validation_0-logloss:0.091527
[58]	validation_0-logloss:0.091986
[59]	validation_0-logloss:0.091951
[60]	validation_0-logloss:0.091939
[61]	validation_0-logloss:0.091461
[62]	validation_0-logloss:0.090311
Stopping. Best iteration:
[52]	validation_0-logloss:0.089577

 

get_clf_eval(y_test , ws10_preds, ws10_pred_proba)

오차 행렬
[[34  3]
 [ 2 75]]
정확도: 0.9561, 정밀도: 0.9615, 재현율: 0.9740,    F1: 0.9677, AUC:0.9947

 

from xgboost import plot_importance
import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots(figsize=(10, 12))
# 사이킷런 래퍼 클래스를 입력해도 무방. 
plot_importance(xgb_wrapper, ax=ax)

반응형

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

4-7. 산탄데르 고객 만족 예측 || XGBoost, LightGBM  (0) 2021.12.25
4.6 LightGBM || 유방암 예측  (0) 2021.12.23
4-4. Gradient Boosting Machine  (0) 2021.12.23
4-3. Random Forest  (0) 2021.12.23
4-2. Ensemble Classifier  (0) 2021.12.23
728x90
반응형

GBM(Gradient Boosting Machine)

weak learner를 -순차-적으로 학습, 예측 하면서 잘못 예측한 데이터에 가중치 부여해서 오류개선하는 방식, adaboost & gbm(adaboost+경사하강법★)

from sklearn.ensemble import GradientBoostingClassifier
import time
import warnings
warnings.filterwarnings('ignore')

X_train, X_test, y_train, y_test = get_human_dataset()

# GBM 수행 시간 측정을 위함. 시작 시간 설정.
start_time = time.time()

gb_clf = GradientBoostingClassifier(random_state=0) # 기본값은 100
gb_clf.fit(X_train , y_train)
gb_pred = gb_clf.predict(X_test)
gb_accuracy = accuracy_score(y_test, gb_pred)

print('GBM 정확도: {0:.4f}'.format(gb_accuracy))
print("GBM 수행 시간: {0:.1f} 초 ".format(time.time() - start_time))

# GBM 정확도: 0.9389
GBM 수행 시간: 709.5 초

 

from sklearn.model_selection import GridSearchCV

params = {
    'n_estimators':[100, 500],
    'learning_rate' : [ 0.05, 0.1]
}
grid_cv = GridSearchCV(gb_clf , param_grid=params , cv=2 ,verbose=1)
grid_cv.fit(X_train , y_train)
print('최적 하이퍼 파라미터:\n', grid_cv.best_params_)
print('최고 예측 정확도: {0:.4f}'.format(grid_cv.best_score_))

 

scores_df = pd.DataFrame(grid_cv.cv_results_)
scores_df[['params', 'mean_test_score', 'rank_test_score',
'split0_test_score', 'split1_test_score']]

 

# GridSearchCV를 이용하여 최적으로 학습된 estimator로 predict 수행. 
gb_pred = grid_cv.best_estimator_.predict(X_test)
gb_accuracy = accuracy_score(y_test, gb_pred)
print('GBM 정확도: {0:.4f}'.format(gb_accuracy))

# GBM 정확도: 0.9406

 

반응형
728x90
반응형

Random Forest

결정 트리에서 사용한 사용자 행동 인지 데이터 세트 로딩 DT & Bagging(sampling) bootstrapping 분할방식, 여러 개 데이터 세트 중첩 분리 => 신뢰도 확보

수정 버전 01: 날짜 2019.10.27일

원본 데이터에 중복된 Feature 명으로 인하여 신규 버전의 Pandas에서 Duplicate name 에러를 발생.
중복 feature명에 대해서 원본 feature 명에 '_1(또는2)'를 추가로 부여하는 함수인 get_new_feature_name_df() 생성

import pandas as pd

def get_new_feature_name_df(old_feature_name_df):
    feature_dup_df = pd.DataFrame(data=old_feature_name_df.groupby('column_name').cumcount(), columns=['dup_cnt'])
    feature_dup_df = feature_dup_df.reset_index()
    new_feature_name_df = pd.merge(old_feature_name_df.reset_index(), feature_dup_df, how='outer')
    new_feature_name_df['column_name'] = new_feature_name_df[['column_name', 'dup_cnt']].apply(lambda x : x[0]+'_'+str(x[1]) 
                                                                                           if x[1] >0 else x[0] ,  axis=1)
    new_feature_name_df = new_feature_name_df.drop(['index'], axis=1)
    return new_feature_name_df

 

features.txt
0.02MB
import pandas as pd

def get_human_dataset( ):
    
    # 각 데이터 파일들은 공백으로 분리되어 있으므로 read_csv에서 공백 문자를 sep으로 할당.
    feature_name_df = pd.read_csv('./human_activity/features.txt',sep='\s+',
                        header=None,names=['column_index','column_name'])
    
    # 중복된 feature명을 새롭게 수정하는 get_new_feature_name_df()를 이용하여 새로운 feature명 DataFrame생성. 
    new_feature_name_df = get_new_feature_name_df(feature_name_df)
    
    # DataFrame에 피처명을 컬럼으로 부여하기 위해 리스트 객체로 다시 변환
    feature_name = new_feature_name_df.iloc[:, 1].values.tolist()
    
    # 학습 피처 데이터 셋과 테스트 피처 데이터을 DataFrame으로 로딩. 컬럼명은 feature_name 적용
    X_train = pd.read_csv('./human_activity/train/X_train.txt',sep='\s+', names=feature_name )
    X_test = pd.read_csv('./human_activity/test/X_test.txt',sep='\s+', names=feature_name)
    
    # 학습 레이블과 테스트 레이블 데이터을 DataFrame으로 로딩하고 컬럼명은 action으로 부여
    y_train = pd.read_csv('./human_activity/train/y_train.txt',sep='\s+',header=None,names=['action'])
    y_test = pd.read_csv('./human_activity/test/y_test.txt',sep='\s+',header=None,names=['action'])
    
    # 로드된 학습/테스트용 DataFrame을 모두 반환 
    return X_train, X_test, y_train, y_test


X_train, X_test, y_train, y_test = get_human_dataset()

 

학습/테스트 데이터로 분리하고 랜덤 포레스트로 학습/예측/평가

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# 결정 트리에서 사용한 get_human_dataset( )을 이용해 학습/테스트용 DataFrame 반환
X_train, X_test, y_train, y_test = get_human_dataset()

# 랜덤 포레스트 학습 및 별도의 테스트 셋으로 예측 성능 평가
rf_clf = RandomForestClassifier(random_state=0)
rf_clf.fit(X_train , y_train)
pred = rf_clf.predict(X_test)
accuracy = accuracy_score(y_test , pred)
print('랜덤 포레스트 정확도: {0:.4f}'.format(accuracy))

# 랜덤 포레스트 정확도: 0.9253

 

GridSearchCV 로 교차검증 및 하이퍼 파라미터 튜닝

from sklearn.model_selection import GridSearchCV

params = {
    'n_estimators':[100],
    'max_depth' : [6, 8, 10, 12], 
    'min_samples_leaf' : [8, 12, 18 ],
    'min_samples_split' : [8, 16, 20]
}
# RandomForestClassifier 객체 생성 후 GridSearchCV 수행
rf_clf = RandomForestClassifier(random_state=0, n_jobs=-1) # 전체 cpu를 작동해라
grid_cv = GridSearchCV(rf_clf , param_grid=params , cv=2, n_jobs=-1 )
grid_cv.fit(X_train , y_train)

print('최적 하이퍼 파라미터:\n', grid_cv.best_params_)
print('최고 예측 정확도: {0:.4f}'.format(grid_cv.best_score_))

# 최적 하이퍼 파라미터:
#  {'max_depth': 10, 'min_samples_leaf': 8, 'min_samples_split': 8, 'n_estimators': 100}
# 최고 예측 정확도: 0.9180

 

튜닝된 하이퍼 파라미터로 재 학습 및 예측/평가 nes를 300으로

rf_clf1 = RandomForestClassifier(n_estimators=300, max_depth=10, min_samples_leaf=8, \
                                 min_samples_split=8, random_state=0)
rf_clf1.fit(X_train , y_train)
pred = rf_clf1.predict(X_test)
print('예측 정확도: {0:.4f}'.format(accuracy_score(y_test , pred)))

# 예측 정확도: 0.9165

 

개별 feature들의 중요도 시각화

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

ftr_importances_values = rf_clf1.feature_importances_
ftr_importances = pd.Series(ftr_importances_values,index=X_train.columns  )
ftr_top20 = ftr_importances.sort_values(ascending=False)[:20]

plt.figure(figsize=(8,6))
plt.title('Feature importances Top 20')
sns.barplot(x=ftr_top20 , y = ftr_top20.index)
plt.show()

 

반응형
728x90
반응형

Voting Classifier

위스콘신 유방암 데이터 로드

import pandas as pd

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

cancer = load_breast_cancer()

data_df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
data_df.head(3)

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
0	17.99	10.38	122.8	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	...	25.38	17.33	184.6	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	20.57	17.77	132.9	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...	24.99	23.41	158.8	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	19.69	21.25	130.0	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	...	23.57	25.53	152.5	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3 rows × 30 columns

 

VotingClassifier로 개별모델은 로지스틱 회귀와 KNN을 보팅방식으로 결합하고 성능 비교

# 개별 모델은 로지스틱 회귀와 KNN 임. 
lr_clf = LogisticRegression()
knn_clf = KNeighborsClassifier(n_neighbors=8)

# 개별 모델을 소프트 보팅 기반의 앙상블 모델로 구현한 분류기 
vo_clf = VotingClassifier( estimators=[('LR',lr_clf),('KNN',knn_clf)] , voting='soft' )

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, 
                                                    test_size=0.2 , random_state= 156)

# VotingClassifier 학습/예측/평가. 
vo_clf.fit(X_train , y_train)
pred = vo_clf.predict(X_test)
print('Voting 분류기 정확도: {0:.4f}'.format(accuracy_score(y_test , pred)))

# 개별 모델의 학습/예측/평가.
classifiers = [lr_clf, knn_clf]
for classifier in classifiers:
    classifier.fit(X_train , y_train)
    pred = classifier.predict(X_test)
    class_name= classifier.__class__.__name__
    print('{0} 정확도: {1:.4f}'.format(class_name, accuracy_score(y_test , pred)))
    
    
    
Voting 분류기 정확도: 0.9474
LogisticRegression 정확도: 0.9386
KNeighborsClassifier 정확도: 0.9386

 

 

반응형
728x90
반응형

결정 트리 모델의 시각화(Decision Tree Visualization)

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

# DecisionTree Classifier 생성
dt_clf = DecisionTreeClassifier(random_state=156)

# 붓꽃 데이터를 로딩하고, 학습과 테스트 데이터 셋으로 분리
iris_data = load_iris()
X_train , X_test , y_train , y_test = train_test_split(iris_data.data, iris_data.target,
                                                       test_size=0.2,  random_state=11)

# DecisionTreeClassifer 학습. 
dt_clf.fit(X_train , y_train)

# DecisionTreeClassifier(random_state=156)

 

from sklearn.tree import export_graphviz

# export_graphviz()의 호출 결과로 out_file로 지정된 tree.dot 파일을 생성함. 
export_graphviz(dt_clf, out_file="tree.dot", class_names=iris_data.target_names , \
feature_names = iris_data.feature_names, impurity=True, filled=True)

 

import graphviz

# 위에서 생성된 tree.dot 파일을 Graphviz 읽어서 Jupyter Notebook상에서 시각화 
with open("tree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

 

import seaborn as sns
import numpy as np
%matplotlib inline

# feature importance 추출 
print("Feature importances:\n{0}".format(np.round(dt_clf.feature_importances_, 3)))

# feature별 importance 매핑
for name, value in zip(iris_data.feature_names , dt_clf.feature_importances_):
    print('{0} : {1:.3f}'.format(name, value))

# feature importance를 column 별로 시각화 하기 
sns.barplot(x=dt_clf.feature_importances_ , y=iris_data.feature_names)

Feature importances:
[0.025 0.    0.555 0.42 ]
sepal length (cm) : 0.025
sepal width (cm) : 0.000
petal length (cm) : 0.555
petal width (cm) : 0.420

 

결정 트리(Decision TREE) 과적합(Overfitting)

from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
%matplotlib inline

plt.title("3 Class values with 2 Features Sample data creation")

# 2차원 시각화를 위해서 feature는 2개, 결정값 클래스는 3가지 유형의 classification 샘플 데이터 생성. 
X_features, y_labels = make_classification(n_features=2, n_redundant=0, n_informative=2,
                             n_classes=3, n_clusters_per_class=1,random_state=0)

# plot 형태로 2개의 feature로 2차원 좌표 시각화, 각 클래스값은 다른 색깔로 표시됨. 
plt.scatter(X_features[:, 0], X_features[:, 1], marker='o', c=y_labels, s=25, cmap='rainbow', edgecolor='k')

 

import numpy as np

# Classifier의 Decision Boundary를 시각화 하는 함수
def visualize_boundary(model, X, y):
    fig,ax = plt.subplots()
    
    # 학습 데이타 scatter plot으로 나타내기
    ax.scatter(X[:, 0], X[:, 1], c=y, s=25, cmap='rainbow', edgecolor='k',
               clim=(y.min(), y.max()), zorder=3)
    ax.axis('tight')
    ax.axis('off')
    xlim_start , xlim_end = ax.get_xlim()
    ylim_start , ylim_end = ax.get_ylim()
    
    # 호출 파라미터로 들어온 training 데이타로 model 학습 . 
    model.fit(X, y)
    # meshgrid 형태인 모든 좌표값으로 예측 수행. 
    xx, yy = np.meshgrid(np.linspace(xlim_start,xlim_end, num=200),np.linspace(ylim_start,ylim_end, num=200))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    
    # contourf() 를 이용하여 class boundary 를 visualization 수행. 
    n_classes = len(np.unique(y))
    contours = ax.contourf(xx, yy, Z, alpha=0.3,
                           levels=np.arange(n_classes + 1) - 0.5,
                           cmap='rainbow', clim=(y.min(), y.max()),
                           zorder=1)

 

from sklearn.tree import DecisionTreeClassifier

# 특정한 트리 생성 제약없는 결정 트리의 Decsion Boundary 시각화.
dt_clf = DecisionTreeClassifier().fit(X_features, y_labels)
visualize_boundary(dt_clf, X_features, y_labels)

 

# min_samples_leaf=6 으로 트리 생성 조건을 제약한 Decision Boundary 시각화
dt_clf = DecisionTreeClassifier( min_samples_leaf=6).fit(X_features, y_labels)
visualize_boundary(dt_clf, X_features, y_labels)

 

결정 트리 실습 - Human Activity Recognition

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# features.txt 파일에는 피처 이름 index와 피처명이 공백으로 분리되어 있음. 이를 DataFrame으로 로드.
feature_name_df = pd.read_csv('./human_activity/features.txt',sep='\s+',
                        header=None,names=['column_index','column_name'])

# 피처명 index를 제거하고, 피처명만 리스트 객체로 생성한 뒤 샘플로 10개만 추출
feature_name = feature_name_df.iloc[:, 1].values.tolist()
print('전체 피처명에서 10개만 추출:', feature_name[:10])
feature_name_df.head(20)



전체 피처명에서 10개만 추출: ['tBodyAcc-mean()-X', 'tBodyAcc-mean()-Y', 'tBodyAcc-mean()-Z', 'tBodyAcc-std()-X', 'tBodyAcc-std()-Y', 'tBodyAcc-std()-Z', 'tBodyAcc-mad()-X', 'tBodyAcc-mad()-Y', 'tBodyAcc-mad()-Z', 'tBodyAcc-max()-X']
column_index	column_name
0	1	tBodyAcc-mean()-X
1	2	tBodyAcc-mean()-Y
2	3	tBodyAcc-mean()-Z
3	4	tBodyAcc-std()-X
4	5	tBodyAcc-std()-Y
5	6	tBodyAcc-std()-Z
6	7	tBodyAcc-mad()-X
7	8	tBodyAcc-mad()-Y
8	9	tBodyAcc-mad()-Z
9	10	tBodyAcc-max()-X
10	11	tBodyAcc-max()-Y
11	12	tBodyAcc-max()-Z
12	13	tBodyAcc-min()-X
13	14	tBodyAcc-min()-Y
14	15	tBodyAcc-min()-Z
15	16	tBodyAcc-sma()
16	17	tBodyAcc-energy()-X
17	18	tBodyAcc-energy()-Y
18	19	tBodyAcc-energy()-Z
19	20	tBodyAcc-iqr()-X

 

수정 버전 01

원본 데이터에 중복된 Feature 명으로 인하여 신규 버전의 Pandas에서 Duplicate name 에러를 발생.
중복 feature명에 대해서 원본 feature 명에 '_1(또는2)'를 추가로 부여하는 함수인 get_new_feature_name_df() 생성

def get_new_feature_name_df(old_feature_name_df):
    feature_dup_df = pd.DataFrame(data=old_feature_name_df.groupby('column_name').cumcount(), columns=['dup_cnt'])
    feature_dup_df = feature_dup_df.reset_index()
    new_feature_name_df = pd.merge(old_feature_name_df.reset_index(), feature_dup_df, how='outer')
    new_feature_name_df['column_name'] = new_feature_name_df[['column_name', 'dup_cnt']].apply(lambda x : x[0]+'_'+str(x[1]) 
                                                                                           if x[1] >0 else x[0] ,  axis=1)
    new_feature_name_df = new_feature_name_df.drop(['index'], axis=1)
    return new_feature_name_df

 

pd.options.display.max_rows = 999
new_feature_name_df = get_new_feature_name_df(feature_name_df)
new_feature_name_df[new_feature_name_df['dup_cnt'] > 0]

	column_index	column_name	dup_cnt
316	317	fBodyAcc-bandsEnergy()-1,8_1	1
317	318	fBodyAcc-bandsEnergy()-9,16_1	1
318	319	fBodyAcc-bandsEnergy()-17,24_1	1
319	320	fBodyAcc-bandsEnergy()-25,32_1	1
320	321	fBodyAcc-bandsEnergy()-33,40_1	1
321	322	fBodyAcc-bandsEnergy()-41,48_1	1
322	323	fBodyAcc-bandsEnergy()-49,56_1	1
323	324	fBodyAcc-bandsEnergy()-57,64_1	1
324	325	fBodyAcc-bandsEnergy()-1,16_1	1
325	326	fBodyAcc-bandsEnergy()-17,32_1	1
326	327	fBodyAcc-bandsEnergy()-33,48_1	1
327	328	fBodyAcc-bandsEnergy()-49,64_1	1
328	329	fBodyAcc-bandsEnergy()-1,24_1	1
329	330	fBodyAcc-bandsEnergy()-25,48_1	1
330	331	fBodyAcc-bandsEnergy()-1,8_2	2
331	332	fBodyAcc-bandsEnergy()-9,16_2	2
332	333	fBodyAcc-bandsEnergy()-17,24_2	2
333	334	fBodyAcc-bandsEnergy()-25,32_2	2
334	335	fBodyAcc-bandsEnergy()-33,40_2	2
335	336	fBodyAcc-bandsEnergy()-41,48_2	2
336	337	fBodyAcc-bandsEnergy()-49,56_2	2
337	338	fBodyAcc-bandsEnergy()-57,64_2	2
338	339	fBodyAcc-bandsEnergy()-1,16_2	2
339	340	fBodyAcc-bandsEnergy()-17,32_2	2
340	341	fBodyAcc-bandsEnergy()-33,48_2	2
341	342	fBodyAcc-bandsEnergy()-49,64_2	2
342	343	fBodyAcc-bandsEnergy()-1,24_2	2
343	344	fBodyAcc-bandsEnergy()-25,48_2	2
395	396	fBodyAccJerk-bandsEnergy()-1,8_1	1
396	397	fBodyAccJerk-bandsEnergy()-9,16_1	1
397	398	fBodyAccJerk-bandsEnergy()-17,24_1	1
398	399	fBodyAccJerk-bandsEnergy()-25,32_1	1
399	400	fBodyAccJerk-bandsEnergy()-33,40_1	1
400	401	fBodyAccJerk-bandsEnergy()-41,48_1	1
401	402	fBodyAccJerk-bandsEnergy()-49,56_1	1
402	403	fBodyAccJerk-bandsEnergy()-57,64_1	1
403	404	fBodyAccJerk-bandsEnergy()-1,16_1	1
404	405	fBodyAccJerk-bandsEnergy()-17,32_1	1
405	406	fBodyAccJerk-bandsEnergy()-33,48_1	1
406	407	fBodyAccJerk-bandsEnergy()-49,64_1	1
407	408	fBodyAccJerk-bandsEnergy()-1,24_1	1
408	409	fBodyAccJerk-bandsEnergy()-25,48_1	1
409	410	fBodyAccJerk-bandsEnergy()-1,8_2	2
410	411	fBodyAccJerk-bandsEnergy()-9,16_2	2
411	412	fBodyAccJerk-bandsEnergy()-17,24_2	2
412	413	fBodyAccJerk-bandsEnergy()-25,32_2	2
413	414	fBodyAccJerk-bandsEnergy()-33,40_2	2
414	415	fBodyAccJerk-bandsEnergy()-41,48_2	2
415	416	fBodyAccJerk-bandsEnergy()-49,56_2	2
416	417	fBodyAccJerk-bandsEnergy()-57,64_2	2
417	418	fBodyAccJerk-bandsEnergy()-1,16_2	2
418	419	fBodyAccJerk-bandsEnergy()-17,32_2	2
419	420	fBodyAccJerk-bandsEnergy()-33,48_2	2
420	421	fBodyAccJerk-bandsEnergy()-49,64_2	2
421	422	fBodyAccJerk-bandsEnergy()-1,24_2	2
422	423	fBodyAccJerk-bandsEnergy()-25,48_2	2
474	475	fBodyGyro-bandsEnergy()-1,8_1	1
475	476	fBodyGyro-bandsEnergy()-9,16_1	1
476	477	fBodyGyro-bandsEnergy()-17,24_1	1
477	478	fBodyGyro-bandsEnergy()-25,32_1	1
478	479	fBodyGyro-bandsEnergy()-33,40_1	1
479	480	fBodyGyro-bandsEnergy()-41,48_1	1
480	481	fBodyGyro-bandsEnergy()-49,56_1	1
481	482	fBodyGyro-bandsEnergy()-57,64_1	1
482	483	fBodyGyro-bandsEnergy()-1,16_1	1
483	484	fBodyGyro-bandsEnergy()-17,32_1	1
484	485	fBodyGyro-bandsEnergy()-33,48_1	1
485	486	fBodyGyro-bandsEnergy()-49,64_1	1
486	487	fBodyGyro-bandsEnergy()-1,24_1	1
487	488	fBodyGyro-bandsEnergy()-25,48_1	1
488	489	fBodyGyro-bandsEnergy()-1,8_2	2
489	490	fBodyGyro-bandsEnergy()-9,16_2	2
490	491	fBodyGyro-bandsEnergy()-17,24_2	2
491	492	fBodyGyro-bandsEnergy()-25,32_2	2
492	493	fBodyGyro-bandsEnergy()-33,40_2	2
493	494	fBodyGyro-bandsEnergy()-41,48_2	2
494	495	fBodyGyro-bandsEnergy()-49,56_2	2
495	496	fBodyGyro-bandsEnergy()-57,64_2	2
496	497	fBodyGyro-bandsEnergy()-1,16_2	2
497	498	fBodyGyro-bandsEnergy()-17,32_2	2
498	499	fBodyGyro-bandsEnergy()-33,48_2	2
499	500	fBodyGyro-bandsEnergy()-49,64_2	2
500	501	fBodyGyro-bandsEnergy()-1,24_2	2
501	502	fBodyGyro-bandsEnergy()-25,48_2	2

 

아래 get_human_dataset() 함수는 중복된 feature명을 새롭게 수정하는 get_new_feature_name_df() 함수를 반영하여 수정

import pandas as pd

def get_human_dataset( ):
    
    # 각 데이터 파일들은 공백으로 분리되어 있으므로 read_csv에서 공백 문자를 sep으로 할당.
    feature_name_df = pd.read_csv('./human_activity/features.txt',sep='\s+',
                        header=None,names=['column_index','column_name'])
    
    # 중복된 feature명을 새롭게 수정하는 get_new_feature_name_df()를 이용하여 새로운 feature명 DataFrame생성. 
    new_feature_name_df = get_new_feature_name_df(feature_name_df)
    
    # DataFrame에 피처명을 컬럼으로 부여하기 위해 리스트 객체로 다시 변환
    feature_name = new_feature_name_df.iloc[:, 1].values.tolist()
    
    # 학습 피처 데이터 셋과 테스트 피처 데이터을 DataFrame으로 로딩. 컬럼명은 feature_name 적용
    X_train = pd.read_csv('./human_activity/train/X_train.txt',sep='\s+', names=feature_name )
    X_test = pd.read_csv('./human_activity/test/X_test.txt',sep='\s+', names=feature_name)
    
    # 학습 레이블과 테스트 레이블 데이터을 DataFrame으로 로딩하고 컬럼명은 action으로 부여
    y_train = pd.read_csv('./human_activity/train/y_train.txt',sep='\s+',header=None,names=['action'])
    y_test = pd.read_csv('./human_activity/test/y_test.txt',sep='\s+',header=None,names=['action'])
    
    # 로드된 학습/테스트용 DataFrame을 모두 반환 
    return X_train, X_test, y_train, y_test


X_train, X_test, y_train, y_test = get_human_dataset()

 

print('## 학습 피처 데이터셋 info()')
print(X_train.info())

## 학습 피처 데이터셋 info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7352 entries, 0 to 7351
Columns: 561 entries, tBodyAcc-mean()-X to angle(Z,gravityMean)
dtypes: float64(561)
memory usage: 31.5 MB
None

 

print(y_train['action'].value_counts())

6    1407
5    1374
4    1286
1    1226
2    1073
3     986
Name: action, dtype: int64

 

X_train.isna().sum().sum()

# 0

 

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 예제 반복 시 마다 동일한 예측 결과 도출을 위해 random_state 설정
dt_clf = DecisionTreeClassifier(random_state=156)
dt_clf.fit(X_train , y_train)
pred = dt_clf.predict(X_test)
accuracy = accuracy_score(y_test , pred)
print('결정 트리 예측 정확도: {0:.4f}'.format(accuracy))

# DecisionTreeClassifier의 하이퍼 파라미터 추출
print('DecisionTreeClassifier 기본 하이퍼 파라미터:\n', dt_clf.get_params())

결정 트리 예측 정확도: 0.8548
DecisionTreeClassifier 기본 하이퍼 파라미터:
 {'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'presort': 'deprecated', 'random_state': 156, 'splitter': 'best'}

 

from sklearn.model_selection import GridSearchCV

params = {
    'max_depth' : [ 6, 8 ,10, 12, 16 ,20, 24]
}

grid_cv = GridSearchCV(dt_clf, param_grid=params, scoring='accuracy', cv=5, verbose=1 )
grid_cv.fit(X_train , y_train)
print('GridSearchCV 최고 평균 정확도 수치:{0:.4f}'.format(grid_cv.best_score_))
print('GridSearchCV 최적 하이퍼 파라미터:', grid_cv.best_params_)

# Fitting 5 folds for each of 7 candidates, totalling 35 fits
# GridSearchCV 최고 평균 정확도 수치:0.8513
# GridSearchCV 최적 하이퍼 파라미터: {'max_depth': 16}

 

사이킷런 버전이 업그레이드 되면서 아래의 GridSearchCV 객체의 cv_results_에서 mean_train_score는 더이상 제공되지 않습니다.
기존 코드에서 오류가 발생하시면 아래와 같이 'mean_train_score'를 제거해 주십시요

# GridSearchCV객체의 cv_results_ 속성을 DataFrame으로 생성. 
cv_results_df = pd.DataFrame(grid_cv.cv_results_)

# max_depth 파라미터 값과 그때의 테스트(Evaluation)셋, 학습 데이터 셋의 정확도 수치 추출
# 사이킷런 버전이 업그레이드 되면서 아래의 GridSearchCV 객체의 cv_results_에서 mean_train_score는 더이상 제공되지 않습니다
# cv_results_df[['param_max_depth', 'mean_test_score', 'mean_train_score']]

# max_depth 파라미터 값과 그때의 테스트(Evaluation)셋, 학습 데이터 셋의 정확도 수치 추출
cv_results_df[['param_max_depth', 'mean_test_score']]

	param_max_depth	mean_test_score
0	6	0.850791
1	8	0.851069
2	10	0.851209
3	12	0.844135
4	16	0.851344
5	20	0.850800
6	24	0.849440

 

max_depths = [ 6, 8 ,10, 12, 16 ,20, 24]
# max_depth 값을 변화 시키면서 그때마다 학습과 테스트 셋에서의 예측 성능 측정
for depth in max_depths:
    dt_clf = DecisionTreeClassifier(max_depth=depth, random_state=156)
    dt_clf.fit(X_train , y_train)
    pred = dt_clf.predict(X_test)
    accuracy = accuracy_score(y_test , pred)
    print('max_depth = {0} 정확도: {1:.4f}'.format(depth , accuracy))
    
max_depth = 6 정확도: 0.8558
max_depth = 8 정확도: 0.8707
max_depth = 10 정확도: 0.8673
max_depth = 12 정확도: 0.8646
max_depth = 16 정확도: 0.8575
max_depth = 20 정확도: 0.8548
max_depth = 24 정확도: 0.8548

 

params = {
    'max_depth' : [ 8 , 12, 16 ,20], 
    'min_samples_split' : [16,24],
}

grid_cv = GridSearchCV(dt_clf, param_grid=params, scoring='accuracy', cv=5, verbose=1 )
grid_cv.fit(X_train , y_train)
print('GridSearchCV 최고 평균 정확도 수치: {0:.4f}'.format(grid_cv.best_score_))
print('GridSearchCV 최적 하이퍼 파라미터:', grid_cv.best_params_)

# GridSearchCV 최고 평균 정확도 수치: 0.8549
# GridSearchCV 최적 하이퍼 파라미터: {'max_depth': 8, 'min_samples_split': 16}

 

best_df_clf = grid_cv.best_estimator_

pred1 = best_df_clf.predict(X_test)
accuracy = accuracy_score(y_test , pred1)
print('결정 트리 예측 정확도:{0:.4f}'.format(accuracy))

# 결정 트리 예측 정확도:0.8717

 

import seaborn as sns

ftr_importances_values = best_df_clf.feature_importances_

# Top 중요도로 정렬을 쉽게 하고, 시본(Seaborn)의 막대그래프로 쉽게 표현하기 위해 Series변환
ftr_importances = pd.Series(ftr_importances_values, index=X_train.columns  )

# 중요도값 순으로 Series를 정렬
ftr_top20 = ftr_importances.sort_values(ascending=False)[:20]
plt.figure(figsize=(8,6))
plt.title('Feature importances Top 20')
sns.barplot(x=ftr_top20 , y = ftr_top20.index)
plt.show()

반응형

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

4-3. Random Forest  (0) 2021.12.23
4-2. Ensemble Classifier  (0) 2021.12.23
3-2. Pima Indians Diabetes prediction  (0) 2021.12.22
3-1. Accuracy || 평가지표  (0) 2021.12.22
2-5. titanic || sklearn  (0) 2021.12.22
728x90
반응형

diabetes.csv
0.02MB

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.metrics import f1_score, confusion_matrix, precision_recall_curve, roc_curve
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

diabetes_data = pd.read_csv('diabetes.csv')
print(diabetes_data['Outcome'].value_counts())
diabetes_data.head(3)


0    500
1    268
Name: Outcome, dtype: int64
Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
  • Pregnancies: 임신 횟수
  • Glucose: 포도당 부하 검사 수치
  • BloodPressure: 혈압(mm Hg)
  • SkinThickness: 팔 삼두근 뒤쪽의 피하지방 측정값(mm)
  • Insulin: 혈청 인슐린(mu U/ml)
  • BMI: 체질량지수(체중(kg)/(키(m))^2)
  • DiabetesPedigreeFunction: 당뇨 내력 가중치 값
  • Age: 나이
  • Outcome: 클래스 결정 값(0또는 1)
diabetes_data.info( )

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     768 non-null int64
BloodPressure               768 non-null int64
SkinThickness               768 non-null int64
Insulin                     768 non-null int64
BMI                         768 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

 

get_clf_eval()과 precision_recall_curve_plot() 재 로딩

def get_clf_eval(y_test, pred=None, pred_proba=None):
    confusion = confusion_matrix( y_test, pred)
    accuracy = accuracy_score(y_test , pred)
    precision = precision_score(y_test , pred)
    recall = recall_score(y_test , pred)
    f1 = f1_score(y_test,pred)
    # ROC-AUC 추가 
    roc_auc = roc_auc_score(y_test, pred_proba)
    print('오차 행렬')
    print(confusion)
    # ROC-AUC print 추가
    print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f},\
    F1: {3:.4f}, AUC:{4:.4f}'.format(accuracy, precision, recall, f1, roc_auc))

 

def precision_recall_curve_plot(y_test=None, pred_proba_c1=None):
    # threshold ndarray와 이 threshold에 따른 정밀도, 재현율 ndarray 추출. 
    precisions, recalls, thresholds = precision_recall_curve( y_test, pred_proba_c1)
    
    # X축을 threshold값으로, Y축은 정밀도, 재현율 값으로 각각 Plot 수행. 정밀도는 점선으로 표시
    plt.figure(figsize=(8,6))
    threshold_boundary = thresholds.shape[0]
    plt.plot(thresholds, precisions[0:threshold_boundary], linestyle='--', label='precision')
    plt.plot(thresholds, recalls[0:threshold_boundary],label='recall')
    
    # threshold 값 X 축의 Scale을 0.1 단위로 변경
    start, end = plt.xlim()
    plt.xticks(np.round(np.arange(start, end, 0.1),2))
    
    # x축, y축 label과 legend, 그리고 grid 설정
    plt.xlabel('Threshold value'); plt.ylabel('Precision and Recall value')
    plt.legend(); plt.grid()
    plt.show()

 

Logistic Regression으로 학습 및 예측 수행

# 피처 데이터 세트 X, 레이블 데이터 세트 y를 추출. 
# 맨 끝이 Outcome 컬럼으로 레이블 값임. 컬럼 위치 -1을 이용해 추출 
X = diabetes_data.iloc[:, :-1]
y = diabetes_data.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 156, stratify=y)

# 로지스틱 회귀로 학습,예측 및 평가 수행. 
lr_clf = LogisticRegression()
lr_clf.fit(X_train , y_train)
pred = lr_clf.predict(X_test)
# roc_auc_score 수정에 따른 추가
pred_proba = lr_clf.predict_proba(X_test)[:, 1]

get_clf_eval(y_test , pred, pred_proba)

# 오차 행렬
# [[87 13]
#  [22 32]]
# 정확도: 0.7727, 정밀도: 0.7111, 재현율: 0.5926,    F1: 0.6465, AUC:0.8083

 

precision recall 곡선 그림

pred_proba_c1 = lr_clf.predict_proba(X_test)[:, 1]
precision_recall_curve_plot(y_test, pred_proba_c1)

 

각 피처들의 값 4분위 분포 확인

diabetes_data.describe()

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
std	3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

 

'Glucose' 피처의 분포도

plt.hist(diabetes_data['Glucose'], bins=10)
(array([  5.,   0.,   4.,  32., 156., 211., 163.,  95.,  56.,  46.]),
 array([  0. ,  19.9,  39.8,  59.7,  79.6,  99.5, 119.4, 139.3, 159.2,
        179.1, 199. ]),
 <a list of 10 Patch objects>)

 

0값이 있는 피처들에서 0값의 데이터 건수와 퍼센트 계산

# 0값을 검사할 피처명 리스트 객체 설정
zero_features = ['Glucose', 'BloodPressure','SkinThickness','Insulin','BMI']

# 전체 데이터 건수
total_count = diabetes_data['Glucose'].count()

# 피처별로 반복 하면서 데이터 값이 0 인 데이터 건수 추출하고, 퍼센트 계산
for feature in zero_features:
    zero_count = diabetes_data[diabetes_data[feature] == 0][feature].count()
    print('{0} 0 건수는 {1}, 퍼센트는 {2:.2f} %'.format(feature, zero_count, 100*zero_count/total_count))
    
    
Glucose 0 건수는 5, 퍼센트는 0.65 %
BloodPressure 0 건수는 35, 퍼센트는 4.56 %
SkinThickness 0 건수는 227, 퍼센트는 29.56 %
Insulin 0 건수는 374, 퍼센트는 48.70 %
BMI 0 건수는 11, 퍼센트는 1.43 %

 

0값을 평균값으로 대체

# zero_features 리스트 내부에 저장된 개별 피처들에 대해서 0값을 평균 값으로 대체
diabetes_data[zero_features]=diabetes_data[zero_features].replace(0, diabetes_data[zero_features].mean())

 

StandardScaler 클래스를 이용해 피처 데이터 세트에 일괄적으로 스케일링 적용하고 0값을 평균값으로 대체한 데이터 세트로 학습/예측

 

X = diabetes_data.iloc[:, :-1]
y = diabetes_data.iloc[:, -1]

# StandardScaler 클래스를 이용해 피처 데이터 세트에 일괄적으로 스케일링 적용
scaler = StandardScaler( )
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.2, random_state = 156, stratify=y)

# 로지스틱 회귀로 학습, 예측 및 평가 수행. 
lr_clf = LogisticRegression()
lr_clf.fit(X_train , y_train)
pred = lr_clf.predict(X_test)
# roc_auc_score 수정에 따른 추가
pred_proba = lr_clf.predict_proba(X_test)[:, 1]
get_clf_eval(y_test , pred, pred_proba)

# 오차 행렬
# [[90 10]
#  [21 33]]
# 정확도: 0.7987, 정밀도: 0.7674, 재현율: 0.6111,    F1: 0.6804, AUC:0.8433

 

분류결정 임곗값을 변경하면서 성능 측정

from sklearn.preprocessing import Binarizer

def get_eval_by_threshold(y_test , pred_proba_c1, thresholds):
    # thresholds 리스트 객체내의 값을 차례로 iteration하면서 Evaluation 수행.
    for custom_threshold in thresholds:
        binarizer = Binarizer(threshold=custom_threshold).fit(pred_proba_c1) 
        custom_predict = binarizer.transform(pred_proba_c1)
        print('임곗값:',custom_threshold)
        # roc_auc_score 관련 수정
        get_clf_eval(y_test , custom_predict, pred_proba_c1)

 

thresholds = [0.3 , 0.33 ,0.36,0.39, 0.42 , 0.45 ,0.48, 0.50]
pred_proba = lr_clf.predict_proba(X_test)
get_eval_by_threshold(y_test, pred_proba[:,1].reshape(-1,1), thresholds )

임곗값: 0.3
오차 행렬
[[65 35]
 [11 43]]
정확도: 0.7013, 정밀도: 0.5513, 재현율: 0.7963,    F1: 0.6515, AUC:0.8433
임곗값: 0.33
오차 행렬
[[71 29]
 [11 43]]
정확도: 0.7403, 정밀도: 0.5972, 재현율: 0.7963,    F1: 0.6825, AUC:0.8433
임곗값: 0.36
오차 행렬
[[76 24]
 [15 39]]
정확도: 0.7468, 정밀도: 0.6190, 재현율: 0.7222,    F1: 0.6667, AUC:0.8433
임곗값: 0.39
오차 행렬
[[78 22]
 [16 38]]
정확도: 0.7532, 정밀도: 0.6333, 재현율: 0.7037,    F1: 0.6667, AUC:0.8433
임곗값: 0.42
오차 행렬
[[84 16]
 [18 36]]
정확도: 0.7792, 정밀도: 0.6923, 재현율: 0.6667,    F1: 0.6792, AUC:0.8433
임곗값: 0.45
오차 행렬
[[85 15]
 [18 36]]
정확도: 0.7857, 정밀도: 0.7059, 재현율: 0.6667,    F1: 0.6857, AUC:0.8433
임곗값: 0.48
오차 행렬
[[88 12]
 [19 35]]
정확도: 0.7987, 정밀도: 0.7447, 재현율: 0.6481,    F1: 0.6931, AUC:0.8433
임곗값: 0.5
오차 행렬
[[90 10]
 [21 33]]
정확도: 0.7987, 정밀도: 0.7674, 재현율: 0.6111,    F1: 0.6804, AUC:0.8433

 

# 임곗값를 0.50로 설정한 Binarizer 생성
binarizer = Binarizer(threshold=0.48)

# 위에서 구한 lr_clf의 predict_proba() 예측 확률 array에서 1에 해당하는 컬럼값을 Binarizer변환. 
pred_th_048 = binarizer.fit_transform(pred_proba[:, 1].reshape(-1,1)) 

# roc_auc_score 관련 수정
get_clf_eval(y_test , pred_th_048, pred_proba[:, 1])

# 오차 행렬
# [[88 12]
#  [19 35]]
# 정확도: 0.7987, 정밀도: 0.7447, 재현율: 0.6481,    F1: 0.6931, AUC:0.8433

 

반응형
728x90
반응형

3-1 Accuracy(정확도) // 2진분류면 잘 사용안함 (정상&비정상//남성여성)

분류 성능 평가지표 - 정확도, 오차행렬, 정밀도, 재현율, F1스코어, ROC AUC

import numpy as np
from sklearn.base import BaseEstimator

class MyDummyClassifier(BaseEstimator):
    # fit( ) 메소드는 아무것도 학습하지 않음. 
    def fit(self, X , y=None):
        pass
    
    # predict( ) 메소드는 단순히 Sex feature가 1 이면 0 , 그렇지 않으면 1 로 예측함. 
    def predict(self, X):
        pred = np.zeros( ( X.shape[0], 1 ))
        for i in range (X.shape[0]) :
            if X['Sex'].iloc[i] == 1:
                pred[i] = 0
            else :
                pred[i] = 1
        
        return pred

 

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Null 처리 함수
def fillna(df):
    df['Age'].fillna(df['Age'].mean(),inplace=True)
    df['Cabin'].fillna('N',inplace=True)
    df['Embarked'].fillna('N',inplace=True)
    df['Fare'].fillna(0,inplace=True)
    return df

# 머신러닝 알고리즘에 불필요한 속성 제거
def drop_features(df):
    df.drop(['PassengerId','Name','Ticket'],axis=1,inplace=True)
    return df

# 레이블 인코딩 수행. 
def format_features(df):
    df['Cabin'] = df['Cabin'].str[:1]
    features = ['Cabin','Sex','Embarked']
    for feature in features:
        le = LabelEncoder()
        le = le.fit(df[feature])
        df[feature] = le.transform(df[feature])
    return df

# 앞에서 설정한 Data Preprocessing 함수  호출
def transform_features(df):
    df = fillna(df)
    df = drop_features(df)
    df = format_features(df)
    return df

 

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 원본 데이터를 재로딩, 데이터 가공, 학습데이터/테스트 데이터 분할. 
titanic_df = pd.read_csv('./titanic_train.csv')
y_titanic_df = titanic_df['Survived']
X_titanic_df= titanic_df.drop('Survived', axis=1)
X_titanic_df = transform_features(X_titanic_df)
X_train, X_test, y_train, y_test=train_test_split(X_titanic_df, y_titanic_df, \
                                                  test_size=0.2, random_state=0)

# 위에서 생성한 Dummy Classifier를 이용하여 학습/예측/평가 수행. 
myclf = MyDummyClassifier()
myclf.fit(X_train ,y_train)

mypredictions = myclf.predict(X_test)
print('Dummy Classifier의 정확도는: {0:.4f}'.format(accuracy_score(y_test , mypredictions)))

# Dummy Classifier의 정확도는: 0.7877

 

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd

class MyFakeClassifier(BaseEstimator):
    def fit(self,X,y):
        pass
    
    # 입력값으로 들어오는 X 데이터 셋의 크기만큼 모두 0값으로 만들어서 반환
    def predict(self,X):
        return np.zeros( (len(X), 1) , dtype=bool)

# 사이킷런의 내장 데이터 셋인 load_digits( )를 이용하여 MNIST 데이터 로딩
digits = load_digits()

print(digits.data)
print("### digits.data.shape:", digits.data.shape)
print(digits.target)
print("### digits.target.shape:", digits.target.shape)

[[ 0.  0.  5. ...  0.  0.  0.]
 [ 0.  0.  0. ... 10.  0.  0.]
 [ 0.  0.  0. ... 16.  9.  0.]
 ...
 [ 0.  0.  1. ...  6.  0.  0.]
 [ 0.  0.  2. ... 12.  0.  0.]
 [ 0.  0. 10. ... 12.  1.  0.]]
### digits.data.shape: (1797, 64)
[0 1 2 ... 8 9 8]
### digits.target.shape: (1797,)

 

digits.target == 7

# array([False, False, False, ..., False, False, False])

 

# digits번호가 7번이면 True이고 이를 astype(int)로 1로 변환, 7번이 아니면 False이고 0으로 변환. 
y = (digits.target == 7).astype(int)
X_train, X_test, y_train, y_test = train_test_split( digits.data, y, random_state=11)

 

# 불균형한 레이블 데이터 분포도 확인. 
print('레이블 테스트 세트 크기 :', y_test.shape)
print('테스트 세트 레이블 0 과 1의 분포도')
print(pd.Series(y_test).value_counts())

# Dummy Classifier로 학습/예측/정확도 평가
fakeclf = MyFakeClassifier()
fakeclf.fit(X_train , y_train)
fakepred = fakeclf.predict(X_test)
print('모든 예측을 0으로 하여도 정확도는:{:.3f}'.format(accuracy_score(y_test , fakepred)))
# 2진분류에서 정확도는 애매함# 2진분류에서 정확도는 애매함

# 레이블 테스트 세트 크기 : (450,)
# 테스트 세트 레이블 0 과 1의 분포도
# 0    405
# 1     45
# dtype: int64
# 모든 예측을 0으로 하여도 정확도는:0.900

 

Confusion Matrix

2진분류 예측오류가 얼마인지 4분면으로 predicted Class X actual class => negative & positive 예측기준 실제가 같은지

from sklearn.metrics import confusion_matrix

# 앞절의 예측 결과인 fakepred와 실제 결과인 y_test의 Confusion Matrix출력
confusion_matrix(y_test , fakepred)

# array([[405,   0],
#        [ 45,   0]], dtype=int64)

 

정밀도(Precision) 과 재현율(Recall)

정밀도 = TP / (FP + TP) 예측기준, FP가 낮게 /// 재현율 = TP / (FN + TP) 실제기준, FN이 낮게

** MyFakeClassifier의 예측 결과로 정밀도와 재현율 측정**

from sklearn.metrics import accuracy_score, precision_score , recall_score

print("정밀도:", precision_score(y_test, fakepred))
print("재현율:", recall_score(y_test, fakepred))

# 정밀도: 0.0
# 재현율: 0.0

 

** 오차행렬, 정확도, 정밀도, 재현율을 한꺼번에 계산하는 LAMBDA함수 생성 ** 서로 붙어다니니깐

from sklearn.metrics import accuracy_score, precision_score , recall_score , confusion_matrix

def get_clf_eval(y_test , pred):
    confusion = confusion_matrix( y_test, pred)
    accuracy = accuracy_score(y_test , pred)
    precision = precision_score(y_test , pred)
    recall = recall_score(y_test , pred)
    print('오차 행렬')
    print(confusion)
    print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f}'.format(accuracy , precision ,recall))

 

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression

# 원본 데이터를 재로딩, 데이터 가공, 학습데이터/테스트 데이터 분할. 
titanic_df = pd.read_csv('./titanic_train.csv')
y_titanic_df = titanic_df['Survived']
X_titanic_df= titanic_df.drop('Survived', axis=1)
X_titanic_df = transform_features(X_titanic_df)

X_train, X_test, y_train, y_test = train_test_split(X_titanic_df, y_titanic_df, \
                                                    test_size=0.20, random_state=11)

lr_clf = LogisticRegression()

lr_clf.fit(X_train , y_train)
pred = lr_clf.predict(X_test)
get_clf_eval(y_test , pred)

# 오차 행렬
# [[104  14]
#  [ 13  48]]
# 정확도: 0.8492, 정밀도: 0.7742, 재현율: 0.7869

Precision/Recall Trade-off : 둘 중 하나는 포기해야 한다

정밀도(FP X)가 중요한 경우 : 익플 느슨방화벽, 물량빨을 거르는 채, 아닌 애(정상메일)neg를 pos로 하지 않는게 중요, 스팸메일 ///

재현율(FN X)이 중요한 경우 : 크롬 무한방화벽 쪽집개, 맞는애(암 발생)Pos를 neg로 하지 않는게 중요, 암진단, 금융사기판별 ///

불균형레이블클래스 경우 매우 적은 수 결과값에 Pos 설정 1부여

분류의 결정 임곗값(threshold) 조정, but trade off 임계값 별 정밀도, 재현율 조정
조정 방법 => 임계점 0.5이면 그 초과0.55는 0에 해당, 0.45는 1에 해당 => 줄다리기

** predict_proba( ) 메소드 확인 **

사이킷런 estimator객체의 predict_proba()메소드는 분류결정 예측확률을 반환 분류결정임곗값을 조정하면서 예측확률 변경할 수 있음

pred_proba = lr_clf.predict_proba(X_test)
pred  = lr_clf.predict(X_test)
print('pred_proba()결과 Shape : {0}'.format(pred_proba.shape))
print('pred_proba array에서 앞 3개만 샘플로 추출 \n:', pred_proba[:3])

# 예측 확률 array 와 예측 결과값 array 를 concatenate 하여 예측 확률과 결과값을 한눈에 확인
pred_proba_result = np.concatenate([pred_proba , pred.reshape(-1,1)],axis=1)
print('두개의 class 중에서 더 큰 확률을 클래스 값으로 예측 \n',pred_proba_result[:3])
# pred는 1차원이기 때문에 reshape(-1, 1)2차원으로 바꿔줌

#첫번째 열이 0이 될 확률, 2번째 열이 1이 될 확률 // 임계값 몇일까??

pred_proba()결과 Shape : (179, 2)
pred_proba array에서 앞 3개만 샘플로 추출 
: [[0.4616653  0.5383347 ]
 [0.87862763 0.12137237]
 [0.87727002 0.12272998]]
두개의 class 중에서 더 큰 확률을 클래스 값으로 예측 
 [[0.4616653  0.5383347  1.        ]
 [0.87862763 0.12137237 0.        ]
 [0.87727002 0.12272998 0.        ]]

 

 Binarizer 활용 **

from sklearn.preprocessing import Binarizer

X = [[ 1, -1,  2],
     [ 2,  0,  0],
     [ 0,  1.1, 1.2]]

# threshold (1.1) 기준값보다 같거나 작으면 0을, 크면 1을 반환
binarizer = Binarizer(threshold=1.1)                     
print(binarizer.fit_transform(X))

# [[0. 0. 1.]
#  [1. 0. 0.]
#  [0. 0. 1.]]

 

분류 결정 임계값 0.5 기반에서 Binarizer를 이용하여 예측값 변환

from sklearn.preprocessing import Binarizer

#Binarizer의 threshold 설정값. 분류 결정 임곗값임.  
custom_threshold = 0.5

# predict_proba( ) 반환값의 두번째 컬럼 , 즉 Positive 클래스 컬럼 하나만 추출하여 Binarizer를 적용
pred_proba_1 = pred_proba[:,1].reshape(-1,1)

binarizer = Binarizer(threshold=custom_threshold).fit(pred_proba_1) 
custom_predict = binarizer.transform(pred_proba_1)

get_clf_eval(y_test, custom_predict)

# 오차 행렬
# [[104  14]
#  [ 13  48]]
# 정확도: 0.8492, 정밀도: 0.7742, 재현율: 0.7869

 

분류 결정 임계값 0.4 기반에서 Binarizer를 이용하여 예측값 변환 **

# Binarizer의 threshold 설정값을 0.4로 설정. 즉 분류 결정 임곗값을 0.5에서 0.4로 낮춤  
custom_threshold = 0.4
pred_proba_1 = pred_proba[:,1].reshape(-1,1)
binarizer = Binarizer(threshold=custom_threshold).fit(pred_proba_1) 
custom_predict = binarizer.transform(pred_proba_1)

get_clf_eval(y_test , custom_predict)

# 오차 행렬
# [[99 19]
#  [10 51]]
# 정확도: 0.8380, 정밀도: 0.7286, 재현율: 0.8361

 

여러개의 분류 결정 임곗값을 변경하면서 Binarizer를 이용하여 예측값 변환

# 테스트를 수행할 모든 임곗값을 리스트 객체로 저장. 
thresholds = [0.4, 0.45, 0.50, 0.55, 0.60]

def get_eval_by_threshold(y_test , pred_proba_c1, thresholds):
    # thresholds list객체내의 값을 차례로 iteration하면서 Evaluation 수행.
    for custom_threshold in thresholds:
        binarizer = Binarizer(threshold=custom_threshold).fit(pred_proba_c1) 
        custom_predict = binarizer.transform(pred_proba_c1)
        print('임곗값:',custom_threshold)
        get_clf_eval(y_test , custom_predict)

get_eval_by_threshold(y_test ,pred_proba[:,1].reshape(-1,1), thresholds )

임곗값: 0.4
오차 행렬
[[99 19]
 [10 51]]
정확도: 0.8380, 정밀도: 0.7286, 재현율: 0.8361
임곗값: 0.45
오차 행렬
[[103  15]
 [ 12  49]]
정확도: 0.8492, 정밀도: 0.7656, 재현율: 0.8033
임곗값: 0.5
오차 행렬
[[104  14]
 [ 13  48]]
정확도: 0.8492, 정밀도: 0.7742, 재현율: 0.7869
임곗값: 0.55
오차 행렬
[[109   9]
 [ 15  46]]
정확도: 0.8659, 정밀도: 0.8364, 재현율: 0.7541
임곗값: 0.6
오차 행렬
[[112   6]
 [ 16  45]]
정확도: 0.8771, 정밀도: 0.8824, 재현율: 0.7377

 

precision_recall_curve( ) 를 이용하여 임곗값에 따른 정밀도-재현율 변화변화값 추출

from sklearn.metrics import precision_recall_curve

# 레이블 값이 1일때의 예측 확률을 추출 
pred_proba_class1 = lr_clf.predict_proba(X_test)[:, 1] 

# 실제값 데이터 셋과 레이블 값이 1일 때의 예측 확률을 precision_recall_curve 인자로 입력 
precisions, recalls, thresholds = precision_recall_curve(y_test, pred_proba_class1 )
print('반환된 분류 결정 임곗값 배열의 Shape:', thresholds.shape)
print('반환된 precisions 배열의 Shape:', precisions.shape)
print('반환된 recalls 배열의 Shape:', recalls.shape)

print("thresholds 5 sample:", thresholds[:5])
print("precisions 5 sample:", precisions[:5])
print("recalls 5 sample:", recalls[:5])

#반환된 임계값 배열 로우가 147건이므로 샘플로 10건만 추출하되, 임곗값을 15 Step으로 추출. 
thr_index = np.arange(0, thresholds.shape[0], 15)
print('샘플 추출을 위한 임계값 배열의 index 10개:', thr_index)
print('샘플용 10개의 임곗값: ', np.round(thresholds[thr_index], 2))

# 15 step 단위로 추출된 임계값에 따른 정밀도와 재현율 값 
print('샘플 임계값별 정밀도: ', np.round(precisions[thr_index], 3))
print('샘플 임계값별 재현율: ', np.round(recalls[thr_index], 3))


# 반환된 분류 결정 임곗값 배열의 Shape: (143,)
# 반환된 precisions 배열의 Shape: (144,)
# 반환된 recalls 배열의 Shape: (144,)
# thresholds 5 sample: [0.10393302 0.10393523 0.10395998 0.10735757 0.10891579]
# precisions 5 sample: [0.38853503 0.38461538 0.38709677 0.38961039 0.38562092]
# recalls 5 sample: [1.         0.98360656 0.98360656 0.98360656 0.96721311]
# 샘플 추출을 위한 임계값 배열의 index 10개: [  0  15  30  45  60  75  90 105 120 135]
# 샘플용 10개의 임곗값:  [0.1  0.12 0.14 0.19 0.28 0.4  0.57 0.67 0.82 0.95]
# 샘플 임계값별 정밀도:  [0.389 0.44  0.466 0.539 0.647 0.729 0.836 0.949 0.958 1.   ]
# 샘플 임계값별 재현율:  [1.    0.967 0.902 0.902 0.902 0.836 0.754 0.607 0.377 0.148]

 

임곗값의 변경에 따른 정밀도-재현율 변화 곡선을 그림

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
%matplotlib inline

def precision_recall_curve_plot(y_test , pred_proba_c1):
    # threshold ndarray와 이 threshold에 따른 정밀도, 재현율 ndarray 추출. 
    precisions, recalls, thresholds = precision_recall_curve( y_test, pred_proba_c1)
    
    # X축을 threshold값으로, Y축은 정밀도, 재현율 값으로 각각 Plot 수행. 정밀도는 점선으로 표시
    plt.figure(figsize=(8,6))
    threshold_boundary = thresholds.shape[0]
    plt.plot(thresholds, precisions[0:threshold_boundary], linestyle='--', label='precision')
    plt.plot(thresholds, recalls[0:threshold_boundary],label='recall')
    
    # threshold 값 X 축의 Scale을 0.1 단위로 변경
    start, end = plt.xlim()
    plt.xticks(np.round(np.arange(start, end, 0.1),2))
    
    # x축, y축 label과 legend, 그리고 grid 설정
    plt.xlabel('Threshold value'); plt.ylabel('Precision and Recall value')
    plt.legend(); plt.grid()
    plt.show()
    
precision_recall_curve_plot( y_test, lr_clf.predict_proba(X_test)[:, 1] )

F1 Score

정밀도 100 => 확실한 경우만 pos // FP 줄이기 재현율 100 => 모든 경우를 pos // FN 줄이기 neg 0 지정 이 맹점을 보완하기 위해 f1-score 어느 한쪽에 치우치지 않을수록 높은 값 획득

from sklearn.metrics import f1_score 
f1 = f1_score(y_test , pred)
print('F1 스코어: {0:.4f}'.format(f1))

F1 스코어: 0.7805

 

def get_clf_eval(y_test , pred):
    confusion = confusion_matrix( y_test, pred)
    accuracy = accuracy_score(y_test , pred)
    precision = precision_score(y_test , pred)
    recall = recall_score(y_test , pred)
    # F1 스코어 추가
    f1 = f1_score(y_test,pred)
    print('오차 행렬')
    print(confusion)
    # f1 score print 추가
    print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f}, F1:{3:.4f}'.format(accuracy, precision, recall, f1))

thresholds = [0.4 , 0.45 , 0.50 , 0.55 , 0.60]
pred_proba = lr_clf.predict_proba(X_test)
get_eval_by_threshold(y_test, pred_proba[:,1].reshape(-1,1), thresholds)

임곗값: 0.4
오차 행렬
[[99 19]
 [10 51]]
정확도: 0.8380, 정밀도: 0.7286, 재현율: 0.8361, F1:0.7786
임곗값: 0.45
오차 행렬
[[103  15]
 [ 12  49]]
정확도: 0.8492, 정밀도: 0.7656, 재현율: 0.8033, F1:0.7840
임곗값: 0.5
오차 행렬
[[104  14]
 [ 13  48]]
정확도: 0.8492, 정밀도: 0.7742, 재현율: 0.7869, F1:0.7805
임곗값: 0.55
오차 행렬
[[109   9]
 [ 15  46]]
정확도: 0.8659, 정밀도: 0.8364, 재현율: 0.7541, F1:0.7931
임곗값: 0.6
오차 행렬
[[112   6]
 [ 16  45]]
정확도: 0.8771, 정밀도: 0.8824, 재현율: 0.7377, F1:0.8036

 

ROC Curve와 AUC

Receiver Operation Characteristic Curve(무선통신장비 측정 툴)기반 AUC(Area Under Curve)값은 roc 곡선 면적으로 1에 가까울수록 좋은 수치 FPR(False Positive Rate)와 TPR(True Positive Rate) 관계 의학분야에 사용

from sklearn.metrics import roc_curve

# 레이블 값이 1일때의 예측 확률을 추출 
pred_proba_class1 = lr_clf.predict_proba(X_test)[:, 1] 

fprs , tprs , thresholds = roc_curve(y_test, pred_proba_class1)
# 반환된 임곗값 배열 로우가 47건이므로 샘플로 10건만 추출하되, 임곗값을 5 Step으로 추출. 
thr_index = np.arange(0, thresholds.shape[0], 5)
print('샘플 추출을 위한 임곗값 배열의 index 10개:', thr_index)
print('샘플용 10개의 임곗값: ', np.round(thresholds[thr_index], 2))

# 5 step 단위로 추출된 임계값에 따른 FPR, TPR 값
print('샘플 임곗값별 FPR: ', np.round(fprs[thr_index], 3))
print('샘플 임곗값별 TPR: ', np.round(tprs[thr_index], 3))

# 샘플 추출을 위한 임곗값 배열의 index 10개: [ 0  5 10 15 20 25 30 35 40 45 50]
# 샘플용 10개의 임곗값:  [1.97 0.75 0.63 0.59 0.49 0.4  0.31 0.15 0.12 0.11 0.1 ]
# 샘플 임곗값별 FPR:  [0.    0.017 0.034 0.059 0.127 0.161 0.237 0.483 0.61  0.703 0.814]
# 샘플 임곗값별 TPR:  [0.    0.475 0.672 0.754 0.787 0.852 0.885 0.902 0.934 0.967 0.984]

 

def roc_curve_plot(y_test , pred_proba_c1):
    # 임곗값에 따른 FPR, TPR 값을 반환 받음. 
    fprs , tprs , thresholds = roc_curve(y_test ,pred_proba_c1)

    # ROC Curve를 plot 곡선으로 그림. 
    plt.plot(fprs , tprs, label='ROC')
    # 가운데 대각선 직선을 그림. 
    plt.plot([0, 1], [0, 1], 'k--', label='Random')
    
    # FPR X 축의 Scale을 0.1 단위로 변경, X,Y 축명 설정등   
    start, end = plt.xlim()
    plt.xticks(np.round(np.arange(start, end, 0.1),2))
    plt.xlim(0,1); plt.ylim(0,1)
    plt.xlabel('FPR( 1 - Sensitivity )'); plt.ylabel('TPR( Recall )')
    plt.legend()
    plt.show()
    
roc_curve_plot(y_test, lr_clf.predict_proba(X_test)[:, 1] )

from sklearn.metrics import roc_auc_score

### 아래는 roc_auc_score()의 인자를 잘못 입력한 것으로, 책에서 수정이 필요한 부분입니다. 
### 책에서는 roc_auc_score(y_test, pred)로 예측 타겟값을 입력하였으나 
### roc_auc_score(y_test, y_score)로 y_score는 predict_proba()로 호출된 예측 확률 ndarray중 Positive 열에 해당하는 ndarray입니다. 

#pred = lr_clf.predict(X_test)
#roc_score = roc_auc_score(y_test, pred)

pred_proba = lr_clf.predict_proba(X_test)[:, 1]
roc_score = roc_auc_score(y_test, pred_proba)
print('ROC AUC 값: {0:.4f}'.format(roc_score))

# ROC AUC 값: 0.9024

 

def get_clf_eval(y_test, pred=None, pred_proba=None):
    confusion = confusion_matrix( y_test, pred)
    accuracy = accuracy_score(y_test , pred)
    precision = precision_score(y_test , pred)
    recall = recall_score(y_test , pred)
    f1 = f1_score(y_test,pred)
    # ROC-AUC 추가 
    roc_auc = roc_auc_score(y_test, pred_proba)
    print('오차 행렬')
    print(confusion)
    # ROC-AUC print 추가
    print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f},\
          F1: {3:.4f}, AUC:{4:.4f}'.format(accuracy, precision, recall, f1, roc_auc))

 

반응형
728x90
반응형

사이킷런으로 수행하는 타이타닉 생존자 예측

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

titanic_df = pd.read_csv('./titanic_train.csv')
titanic_df.head(3) # 데이터 테이블 상태 확인

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
  • Passengerid: 탑승자 데이터 일련번호
  • survived: 생존 여부, 0 = 사망, 1 = 생존 // target 값
  • Pclass: 티켓의 선실 등급, 1 = 일등석, 2 = 이등석, 3 = 삼등석
  • sex: 탑승자 성별
  • name: 탑승자 이름
  • Age: 탑승자 나이
  • sibsp: 같이 탑승한 형제자매 또는 배우자 인원수
  • parch: 같이 탑승한 부모님 또는 어린이 인원수
  • ticket: 티켓 번호
  • fare: 요금
  • cabin: 선실 번호
  • embarked: 중간 정착 항구 C = Cherbourg, Q = Queenstown, S = Southampton
print('\n ### train 데이터 정보 ###  \n')
print(titanic_df.info())
# 데이터 정보 null 대체, object drop 전처리 

 ### train 데이터 정보 ###  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None

 

** NULL 컬럼들에 대한 처리 **

# null 전처리
titanic_df['Age'].fillna(titanic_df['Age'].mean(),inplace=True)
titanic_df['Cabin'].fillna('N',inplace=True)
titanic_df['Embarked'].fillna('N',inplace=True)

print('데이터 세트 Null 값 갯수 ',titanic_df.isnull().sum().sum()) # 전체칼럼의 null이 있는지 확인 마무리

# 데이터 세트 Null 값 갯수  0

 

print('데이터 세트 Null 값 갯수 ',titanic_df.isnull().sum())

데이터 세트 Null 값 갯수  PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

 

print(' Sex 값 분포 :\n',titanic_df['Sex'].value_counts())
print('\n Cabin 값 분포 :\n',titanic_df['Cabin'].value_counts())
print('\n Embarked 값 분포 :\n',titanic_df['Embarked'].value_counts())


 Sex 값 분포 :
 male      577
female    314
Name: Sex, dtype: int64

 Cabin 값 분포 :
 N              687
G6               4
C23 C25 C27      4
B96 B98          4
F33              3
F2               3
C22 C26          3
D                3
E101             3
C124             2
B49              2
C126             2
E67              2
C92              2
D20              2
D17              2
B20              2
F G73            2
C125             2
C83              2
C2               2
E24              2
C78              2
E33              2
B51 B53 B55      2
E8               2
C68              2
F4               2
B28              2
B35              2
              ... 
T                1
C70              1
B78              1
A10              1
E50              1
E34              1
A16              1
C106             1
C47              1
F38              1
C111             1
B38              1
C87              1
E46              1
A34              1
F G63            1
B42              1
D37              1
C85              1
E63              1
A31              1
D19              1
E40              1
B86              1
A5               1
C91              1
C46              1
A24              1
B80              1
D28              1
Name: Cabin, Length: 148, dtype: int64

 Embarked 값 분포 :
 S    644
C    168
Q     77
N      2
Name: Embarked, dtype: int64

 

titanic_df['Cabin'] = titanic_df['Cabin'].str[:1] # str이라고 표시를 해줘야 함
print(titanic_df['Cabin'].head(3))
titanic_df['Cabin'].value_counts() # 분포확인

0    N
1    C
2    N
Name: Cabin, dtype: object
N    687
C     59
B     47
D     33
E     32
A     15
F     13
G      4
T      1
Name: Cabin, dtype: int64

 

titanic_df.groupby(['Sex','Survived'])['Survived'].count() # 성별과 생존자의 상관관계

Sex     Survived
female  0            81
        1           233
male    0           468
        1           109
Name: Survived, dtype: int64

 

sns.barplot(x='Sex', y = 'Survived', data=titanic_df) # seaborn 시각화

 

sns.barplot(x='Pclass', y='Survived', hue='Sex', data=titanic_df) 
# 선실등급과 생존 상관관계 //hue col 같이 비교

 

# 입력 age에 따라 구분값을 반환하는 함수 설정. DataFrame의 apply lambda식에 사용. 
# 숫자의 경우 미리 카테고리(선실등급1 2 3)로 나눠져있지 않아서 lamda로 나눔
def get_category(age):
    cat = ''
    if age <= -1: cat = 'Unknown'
    elif age <= 5: cat = 'Baby'
    elif age <= 12: cat = 'Child'
    elif age <= 18: cat = 'Teenager'
    elif age <= 25: cat = 'Student'
    elif age <= 35: cat = 'Young Adult'
    elif age <= 60: cat = 'Adult'
    else : cat = 'Elderly'
    
    return cat

# 막대그래프의 크기 figure를 더 크게 설정 
plt.figure(figsize=(10,6))

#X축의 값을 순차적으로 표시하기 위한 설정 
group_names = ['Unknown', 'Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Elderly']

# lambda 식에 위에서 생성한 get_category( ) 함수를 반환값으로 지정. 
# get_category(X)는 입력값으로 'Age' 컬럼값을 받아서 해당하는 cat 반환
titanic_df['Age_cat'] = titanic_df['Age'].apply(lambda x : get_category(x))
sns.barplot(x='Age_cat', y = 'Survived', hue='Sex', data=titanic_df, order=group_names)
titanic_df.drop('Age_cat', axis=1, inplace=True)



# 레이블 인코딩 - 3개 col로 루프로 레이블인코더 = 핏,
from sklearn import preprocessing

def encode_features(dataDF):
    features = ['Cabin', 'Sex', 'Embarked']
    for feature in features:
        le = preprocessing.LabelEncoder()
        le = le.fit(dataDF[feature])
        dataDF[feature] = le.transform(dataDF[feature])
        
    return dataDF

titanic_df = encode_features(titanic_df)
titanic_df.head()


PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	1	22.0	1	0	A/5 21171	7.2500	7	3
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	38.0	1	0	PC 17599	71.2833	2	0
2	3	1	3	Heikkinen, Miss. Laina	0	26.0	0	0	STON/O2. 3101282	7.9250	7	3
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	35.0	1	0	113803	53.1000	2	3
4	5	0	3	Allen, Mr. William Henry	1	35.0	0	0	373450	8.0500	7	3

 

from sklearn.preprocessing import LabelEncoder

# Null 처리 함수
def fillna(df):
    df['Age'].fillna(df['Age'].mean(),inplace=True)
    df['Cabin'].fillna('N',inplace=True)
    df['Embarked'].fillna('N',inplace=True)
    df['Fare'].fillna(0,inplace=True)
    return df

# 머신러닝 알고리즘에 불필요한 속성 제거
def drop_features(df):
    df.drop(['PassengerId','Name','Ticket'],axis=1,inplace=True) # id, 이름, 티켓
    return df

# 레이블 인코딩 수행. 
def format_features(df):
    df['Cabin'] = df['Cabin'].str[:1]
    features = ['Cabin','Sex','Embarked']
    for feature in features:
        le = LabelEncoder()
        le = le.fit(df[feature])
        df[feature] = le.transform(df[feature])
    return df

# 앞에서 설정한 Data Preprocessing 함수 호출
def transform_features(df):
    df = fillna(df)
    df = drop_features(df)
    df = format_features(df)
    return df

 

# 원본 데이터를 재로딩 하고, feature데이터 셋과 Label 데이터 셋 추출. 
titanic_df = pd.read_csv('./titanic_train.csv')
y_titanic_df = titanic_df['Survived']
X_titanic_df= titanic_df.drop('Survived',axis=1)

X_titanic_df = transform_features(X_titanic_df)

 

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X_titanic_df, y_titanic_df, \
                                                  test_size=0.2, random_state=11)

 

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 결정트리, Random Forest, 로지스틱 회귀를 위한 사이킷런 Classifier 클래스 생성
dt_clf = DecisionTreeClassifier(random_state=11)
rf_clf = RandomForestClassifier(random_state=11)
lr_clf = LogisticRegression()

# DecisionTreeClassifier 학습/예측/평가
dt_clf.fit(X_train , y_train)
dt_pred = dt_clf.predict(X_test)
print('DecisionTreeClassifier 정확도: {0:.4f}'.format(accuracy_score(y_test, dt_pred)))

# RandomForestClassifier 학습/예측/평가
rf_clf.fit(X_train , y_train)
rf_pred = rf_clf.predict(X_test)
print('RandomForestClassifier 정확도:{0:.4f}'.format(accuracy_score(y_test, rf_pred)))

# LogisticRegression 학습/예측/평가
lr_clf.fit(X_train , y_train)
lr_pred = lr_clf.predict(X_test)
print('LogisticRegression 정확도: {0:.4f}'.format(accuracy_score(y_test, lr_pred)))


# DecisionTreeClassifier 정확도: 0.7877
# RandomForestClassifier 정확도:0.8324
# LogisticRegression 정확도: 0.8659

 

# 검증
from sklearn.model_selection import KFold

def exec_kfold(clf, folds=5):
    # 폴드 세트를 5개인 KFold객체를 생성, 폴드 수만큼 예측결과 저장을 위한  리스트 객체 생성.
    kfold = KFold(n_splits=folds)
    scores = []
    
    # KFold 교차 검증 수행. 
    for iter_count , (train_index, test_index) in enumerate(kfold.split(X_titanic_df)):
        # X_titanic_df 데이터에서 교차 검증별로 학습과 검증 데이터를 가리키는 index 생성
        X_train, X_test = X_titanic_df.values[train_index], X_titanic_df.values[test_index]
        y_train, y_test = y_titanic_df.values[train_index], y_titanic_df.values[test_index]
        
        # Classifier 학습, 예측, 정확도 계산 
        clf.fit(X_train, y_train) 
        predictions = clf.predict(X_test)
        accuracy = accuracy_score(y_test, predictions)
        scores.append(accuracy)
        print("교차 검증 {0} 정확도: {1:.4f}".format(iter_count, accuracy))     
    
    # 5개 fold에서의 평균 정확도 계산. 
    mean_score = np.mean(scores)
    print("평균 정확도: {0:.4f}".format(mean_score)) 
# exec_kfold 호출
exec_kfold(dt_clf , folds=5) 

교차 검증 0 정확도: 0.7542
교차 검증 1 정확도: 0.7809
교차 검증 2 정확도: 0.7865
교차 검증 3 정확도: 0.7697
교차 검증 4 정확도: 0.8202
평균 정확도: 0.7823

 

from sklearn.model_selection import cross_val_score

scores = cross_val_score(dt_clf, X_titanic_df , y_titanic_df , cv=5)
for iter_count,accuracy in enumerate(scores):
    print("교차 검증 {0} 정확도: {1:.4f}".format(iter_count, accuracy))

print("평균 정확도: {0:.4f}".format(np.mean(scores)))

교차 검증 0 정확도: 0.7430
교차 검증 1 정확도: 0.7765
교차 검증 2 정확도: 0.7809
교차 검증 3 정확도: 0.7753
교차 검증 4 정확도: 0.8418
평균 정확도: 0.7835

 

from sklearn.model_selection import GridSearchCV

parameters = {'max_depth':[2,3,5,10],
             'min_samples_split':[2,3,5], 'min_samples_leaf':[1,5,8]}
                grid_dclf = GridSearchCV(dt_clf , param_grid=parameters , scoring='accuracy' , cv=5)
                grid_dclf.fit(X_train , y_train)

print('GridSearchCV 최적 하이퍼 파라미터 :',grid_dclf.best_params_)
print('GridSearchCV 최고 정확도: {0:.4f}'.format(grid_dclf.best_score_))
best_dclf = grid_dclf.best_estimator_

# GridSearchCV의 최적 하이퍼 파라미터로 학습된 Estimator로 예측 및 평가 수행. 
dpredictions = best_dclf.predict(X_test)
accuracy = accuracy_score(y_test , dpredictions)
print('테스트 세트에서의 DecisionTreeClassifier 정확도 : {0:.4f}'.format(accuracy))

# GridSearchCV 최적 하이퍼 파라미터 : {'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2}
# GridSearchCV 최고 정확도: 0.7992
# 테스트 세트에서의 DecisionTreeClassifier 정확도 : 0.8715

 

반응형

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

3-2. Pima Indians Diabetes prediction  (0) 2021.12.22
3-1. Accuracy || 평가지표  (0) 2021.12.22
2-4. Data Preprocessing  (0) 2021.12.22
2-3. model selection || sklearn  (0) 2021.12.22
2-2. sklearn 내장 데이터 || iris  (0) 2021.12.22
728x90
반응형

데이터 전처리(preprocessing) 클링징 : 수정, 결손값 처리 :nullnan 처리(제거, 평균, 추가코드부여).

인코딩(ml은 숫자만 인식해서 2진수(01), 문자를 숫자로 표현),

스케일링(정규화, 표준화 // 키, 몸무게 단위가 다른경우),

이상치 제거,

feature 적절한 선택,

추출가공

데이터 인코딩

  • 레이블 인코딩(Label encoding) : 1개 col으로 해결 상품이름을 상품코드번호로 maping하는 경우(냉장고 => 코드번호 13)
from sklearn.preprocessing import LabelEncoder

items=['TV','냉장고','전자렌지','컴퓨터','선풍기','선풍기','믹서','믹서']

# LabelEncoder를 객체로 생성한 후 , fit( ) 과 transform( ) 으로 label 인코딩 수행. 
encoder = LabelEncoder() # 객체생성
encoder.fit(items) # 형태 맞추기
labels = encoder.transform(items)
print('인코딩 변환값:',labels) # 실제 값 변환

# 인코딩 변환값: [0 1 4 5 3 3 2 2]

 

print('인코딩 클래스:',encoder.classes_)

# 인코딩 클래스: ['TV' '냉장고' '믹서' '선풍기' '전자렌지' '컴퓨터']

 

print('디코딩 원본 값:',encoder.inverse_transform([4, 5, 2, 0, 1, 1, 3, 3])) # 원본값을 보고싶을때

# 디코딩 원본 값: ['전자렌지' '컴퓨터' '믹서' 'TV' '냉장고' '냉장고' '선풍기' '선풍기']

 

  • 원-핫 인코딩(One-Hot encoding) : 차원분류 상품수만큼 col 부여 해당시 1 값 지정 특징 상품이름은 서로 연관이 없음 => 숫자라서 원치않는 방향으로 알고리즘 코딩될 수 있음 => 원핫인코딩 해결 pd get_dummies(인자) 손쉽게 가능
from sklearn.preprocessing import OneHotEncoder 
import numpy as np

items=['TV','냉장고','전자렌지','컴퓨터','선풍기','선풍기','믹서','믹서']

# 먼저 숫자값으로 변환을 위해 LabelEncoder로 변환합니다.  비지도학습인 경우 fit transform 데이터 변환
encoder = LabelEncoder() # 객체생성
encoder.fit(items) # 와꾸 맞추기
labels = encoder.transform(items) # 실제 값 변환

# 1차원 레이블 추가작업
# 2차원 데이터로 변환합니다. 
labels = labels.reshape(-1,1) # col은 하나.행은 동쪽으로

# 원-핫 인코딩을 적용합니다. 
oh_encoder = OneHotEncoder()
oh_encoder.fit(labels)
oh_labels = oh_encoder.transform(labels)

print('원-핫 인코딩 데이터')
print(oh_labels.toarray())
print('원-핫 인코딩 데이터 차원')
print(oh_labels.shape)

# 이 복잡한 과정을 get_dummies로 형식, 고유값문제까지 해결해서 더 깔끔하게 해결

원-핫 인코딩 데이터
[[1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]]
원-핫 인코딩 데이터 차원
(8, 6)

 

import pandas as pd

df = pd.DataFrame({'item':['TV','냉장고','전자렌지','컴퓨터','선풍기','선풍기','믹서','믹서'] })
df

item
0	TV
1	냉장고
2	전자렌지
3	컴퓨터
4	선풍기
5	선풍기
6	믹서
7	믹서

 

pd.get_dummies(df)

	item_TV	item_냉장고	item_믹서	item_선풍기	item_전자렌지	item_컴퓨터
0	1	0	0	0	0	0
1	0	1	0	0	0	0
2	0	0	0	0	1	0
3	0	0	0	0	0	1
4	0	0	0	1	0	0
5	0	0	0	1	0	0
6	0	0	1	0	0	0
7	0	0	1	0	0	0

 

피처 스케일링과 정규화

표준화 : feature 각각을 평균 0이고 분산 1인 가우시안 정규분포를 가진 값으로 변환 정규화 : 서로다른 피처의 크기를 통일 크기변환

 

  • StandardScaler : 평균 0, 분산 1인 정규분포형태 변환
from sklearn.datasets import load_iris
import pandas as pd
# 붓꽃 데이터 셋을 로딩하고 DataFrame으로 변환합니다. 
iris = load_iris()
iris_data = iris.data
iris_df = pd.DataFrame(data=iris_data, columns=iris.feature_nam es)

print('feature 들의 평균 값')
print(iris_df.mean())
print('\nfeature 들의 분산 값')
print(iris_df.var())

feature 들의 평균 값
sepal length (cm)    5.843333
sepal width (cm)     3.054000
petal length (cm)    3.758667
petal width (cm)     1.198667
dtype: float64

feature 들의 분산 값
sepal length (cm)    0.685694
sepal width (cm)     0.188004
petal length (cm)    3.113179
petal width (cm)     0.582414
dtype: float64

 

from sklearn.preprocessing import StandardScaler

# StandardScaler객체 생성
scaler = StandardScaler()
# StandardScaler 로 데이터 셋 변환. fit( ) 과 transform( ) 호출.  
scaler.fit(iris_df)
iris_scaled = scaler.transform(iris_df)

#transform( )시 scale 변환된 데이터 셋이 numpy ndarry로 반환되어 이를 DataFrame으로 변환
iris_df_scaled = pd.DataFrame(data=iris_scaled, columns=iris.feature_names)
print('feature 들의 평균 값')
print(iris_df_scaled.mean())
print('\nfeature 들의 분산 값')
print(iris_df_scaled.var())

feature 들의 평균 값
sepal length (cm)   -1.690315e-15
sepal width (cm)    -1.637024e-15
petal length (cm)   -1.482518e-15
petal width (cm)    -1.623146e-15
dtype: float64

feature 들의 분산 값
sepal length (cm)    1.006711
sepal width (cm)     1.006711
petal length (cm)    1.006711
petal width (cm)     1.006711
dtype: float64

 

  • MinMaxScaler : 데이터값을 0 ~ 1 사이의 범위값으로 변환(음수 값있으면 -1 ~ 1 변환)
from sklearn.preprocessing import MinMaxScaler

# MinMaxScaler객체 생성
scaler = MinMaxScaler()
# MinMaxScaler 로 데이터 셋 변환. fit() 과 transform() 호출.  
scaler.fit(iris_df)
iris_scaled = scaler.transform(iris_df)

# transform()시 scale 변환된 데이터 셋이 numpy ndarry로 반환되어 이를 DataFrame으로 변환
iris_df_scaled = pd.DataFrame(data=iris_scaled, columns=iris.feature_names)
print('feature들의 최소 값')
print(iris_df_scaled.min())
print('\nfeature들의 최대 값')
print(iris_df_scaled.max())

feature들의 최소 값
sepal length (cm)    0.0
sepal width (cm)     0.0
petal length (cm)    0.0
petal width (cm)     0.0
dtype: float64

feature들의 최대 값
sepal length (cm)    1.0
sepal width (cm)     1.0
petal length (cm)    1.0
petal width (cm)     1.0
dtype: float64

 

반응형

'Data_Science > ML_Perfect_Guide' 카테고리의 다른 글

3-1. Accuracy || 평가지표  (0) 2021.12.22
2-5. titanic || sklearn  (0) 2021.12.22
2-3. model selection || sklearn  (0) 2021.12.22
2-2. sklearn 내장 데이터 || iris  (0) 2021.12.22
2-1. iris || sklearn  (0) 2021.12.22

+ Recent posts