Data_Science

27. 프로야구 연봉 예측 분석 || OLS, Heatmap 2021.11.24
26. 서울 중학교 졸업자 분석 || dbscan, folium 2021.11.24
25. 판매 데이터 분석 || kmeans 2021.11.24
24. 위스콘신 유방안데이터 분석 || DT 2021.11.24
23. titanic 분류 예측 | KNN, SVM 2021.11.24
22. auto-mpg || 회귀분석 2021.11.24
21. 서울시 범죄율 분석 || MinMaxscalimg 2021.11.24
20. 서울시 인구분석 || 다중회귀 2021.11.23
19. 세계음주데이터2 2021.11.23
18. 세계음주 데이터 분석 2021.11.03

27. 프로야구 연봉 예측 분석 || OLS, Heatmap

2021. 11. 24. 15:05

728x90

picher_stats_2017.csv

batter_stats_2017.csv

# 프로야구 연봉 예측
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
picher_file_path = 'picher_stats_2017.csv'
batter_file_path = 'batter_stats_2017.csv'
picher = pd.read_csv(picher_file_path)
batter = pd.read_csv(batter_file_path)
batter.columns

Index(['선수명', '팀명', '경기', '타석', '타수', '안타', '홈런', '득점', '타점', '볼넷', '삼진', '도루',
       'BABIP', '타율', '출루율', '장타율', 'OPS', 'wOBA', 'WAR', '연봉(2018)',
       '연봉(2017)'],
      dtype='object')

pi_fea_df = picher[['승','패','세','홀드','블론','경기','선발','이닝','삼진/9',
                    '볼넷/9','홈런/9','BABIP','LOB%','ERA','RA9-WAR','FIP','kFIP','WAR','연봉(2018)','연봉(2017)']]
pi_fea_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152 entries, 0 to 151
Data columns (total 20 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   승         152 non-null    int64  
 1   패         152 non-null    int64  
 2   세         152 non-null    int64  
 3   홀드        152 non-null    int64  
 4   블론        152 non-null    int64  
 5   경기        152 non-null    int64  
 6   선발        152 non-null    int64  
 7   이닝        152 non-null    float64
 8   삼진/9      152 non-null    float64
 9   볼넷/9      152 non-null    float64
 10  홈런/9      152 non-null    float64
 11  BABIP     152 non-null    float64
 12  LOB%      152 non-null    float64
 13  ERA       152 non-null    float64
 14  RA9-WAR   152 non-null    float64
 15  FIP       152 non-null    float64
 16  kFIP      152 non-null    float64
 17  WAR       152 non-null    float64
 18  연봉(2018)  152 non-null    int64  
 19  연봉(2017)  152 non-null    int64  
dtypes: float64(11), int64(9)
memory usage: 23.9 KB

picher = picher.rename(columns = {'연봉(2018)':'y'})

team_encoding = pd.get_dummies(picher['팀명'])
team_encoding.head()

KIA	KT	LG	NC	SK	두산	롯데	삼성	한화
0	0	0	0	0	1	0	0	0	0
1	0	0	1	0	0	0	0	0	0
2	1	0	0	0	0	0	0	0	0
3	0	0	1	0	0	0	0	0	0
4	0	0	0	0	0	0	1	0	0

picher = pd.concat([picher, team_encoding], axis=1)
picher.head()

	선수명	팀명	승	패	세	홀드	블론	경기	선발	이닝	...	연봉(2017)	KIA	KT	LG	NC	SK	두산	롯데	삼성	한화
0	켈리	SK	16	7	0	0	0	30	30	190.0	...	85000	0	0	0	0	1	0	0	0	0
1	소사	LG	11	11	1	0	0	30	29	185.1	...	50000	0	0	1	0	0	0	0	0	0
2	양현종	KIA	20	6	0	0	0	31	31	193.1	...	150000	1	0	0	0	0	0	0	0	0
3	차우찬	LG	10	7	0	0	0	28	28	175.2	...	100000	0	0	1	0	0	0	0	0	0
4	레일리	롯데	13	7	0	0	0	30	30	187.1	...	85000	0	0	0	0	0	0	1	0	0
5 rows × 31 columns

picher = picher.drop('팀명', axis=1)

x = picher[picher.columns.difference({'선수명','y'})]
y = picher['y']

from sklearn import preprocessing as pp
x = pp.StandardScaler().fit(x).transform(x)

pd.options.mode.chained_assignment = None # 과학적표기방법 안씀
# 정규화 함수
def standard_scaling(df, scale_columns) :
    for col in scale_columns :
        s_mean = df[col].mean()
        s_std = df[col].std()
        df[col] =  df[col].apply(lambda x : (x - s_mean)/s_std)
    return df

pi_fea_df = picher[['승','패','세','홀드','블론','경기','선발','이닝','삼진/9',
                    '볼넷/9','홈런/9','BABIP','LOB%','ERA','RA9-WAR','FIP','kFIP','WAR','연봉(2017)']]
pi_fea_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152 entries, 0 to 151
Data columns (total 19 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   승         152 non-null    float64
 1   패         152 non-null    float64
 2   세         152 non-null    float64
 3   홀드        152 non-null    float64
 4   블론        152 non-null    float64
 5   경기        152 non-null    float64
 6   선발        152 non-null    float64
 7   이닝        152 non-null    float64
 8   삼진/9      152 non-null    float64
 9   볼넷/9      152 non-null    float64
 10  홈런/9      152 non-null    float64
 11  BABIP     152 non-null    float64
 12  LOB%      152 non-null    float64
 13  ERA       152 non-null    float64
 14  RA9-WAR   152 non-null    float64
 15  FIP       152 non-null    float64
 16  kFIP      152 non-null    float64
 17  WAR       152 non-null    float64
 18  연봉(2017)  152 non-null    int64  
dtypes: float64(18), int64(1)
memory usage: 22.7 KB

picher_df = standard_scaling(picher,pi_fea_df )

# 정규화된 x
x = picher[picher_df.columns.difference({'선수명','y'})]
x.head()

	BABIP	ERA	FIP	KIA	KT	LG	LOB%	NC	RA9-WAR	SK	...	삼진/9	선발	세	승	연봉(2017)	이닝	패	한화	홀드	홈런/9
0	0.016783	-0.587056	-0.971030	0	0	0	0.446615	0	3.174630	1	...	0.672099	2.452068	-0.306452	3.313623	2.734705	2.645175	1.227145	0	-0.585705	-0.442382
1	-0.241686	-0.519855	-1.061888	0	0	1	-0.122764	0	3.114968	0	...	0.134531	2.349505	-0.098502	2.019505	1.337303	2.547755	2.504721	0	-0.585705	-0.668521
2	-0.095595	-0.625456	-0.837415	1	0	0	0.308584	0	2.973948	0	...	0.109775	2.554632	-0.306452	4.348918	5.329881	2.706808	0.907751	0	-0.585705	-0.412886
3	-0.477680	-0.627856	-0.698455	0	0	1	0.558765	0	2.740722	0	...	0.350266	2.246942	-0.306452	1.760682	3.333592	2.350927	1.227145	0	-0.585705	-0.186746
4	-0.196735	-0.539055	-0.612941	0	0	0	0.481122	0	2.751570	0	...	0.155751	2.452068	-0.306452	2.537153	2.734705	2.587518	1.227145	0	-0.585705	-0.294900
5 rows × 28 columns

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=19)

picher.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152 entries, 0 to 151
Data columns (total 30 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   선수명       152 non-null    object 
 1   승         152 non-null    float64
 2   패         152 non-null    float64
 3   세         152 non-null    float64
 4   홀드        152 non-null    float64
 5   블론        152 non-null    float64
 6   경기        152 non-null    float64
 7   선발        152 non-null    float64
 8   이닝        152 non-null    float64
 9   삼진/9      152 non-null    float64
 10  볼넷/9      152 non-null    float64
 11  홈런/9      152 non-null    float64
 12  BABIP     152 non-null    float64
 13  LOB%      152 non-null    float64
 14  ERA       152 non-null    float64
 15  RA9-WAR   152 non-null    float64
 16  FIP       152 non-null    float64
 17  kFIP      152 non-null    float64
 18  WAR       152 non-null    float64
 19  y         152 non-null    int64  
 20  연봉(2017)  152 non-null    float64
 21  KIA       152 non-null    uint8  
 22  KT        152 non-null    uint8  
 23  LG        152 non-null    uint8  
 24  NC        152 non-null    uint8  
 25  SK        152 non-null    uint8  
 26  두산        152 non-null    uint8  
 27  롯데        152 non-null    uint8  
 28  삼성        152 non-null    uint8  
 29  한화        152 non-null    uint8  
dtypes: float64(19), int64(1), object(1), uint8(9)
memory usage: 26.4+ KB

# ols
import statsmodels.api as sm

x_train = sm.add_constant(x_train)
model = sm.OLS(y_train, x_train).fit()
model.summary()

OLS Regression Results
Dep. Variable:	y	R-squared:	0.928
Model:	OLS	Adj. R-squared:	0.907
Method:	Least Squares	F-statistic:	44.19
Date:	Tue, 20 Jul 2021	Prob (F-statistic):	7.70e-42
Time:	10:11:19	Log-Likelihood:	-1247.8
No. Observations:	121	AIC:	2552.
Df Residuals:	93	BIC:	2630.
Df Model:	27		
Covariance Type:	nonrobust		
coef	std err	t	P>|t|	[0.025	0.975]
const	1.872e+04	775.412	24.136	0.000	1.72e+04	2.03e+04
x1	-1476.1375	1289.136	-1.145	0.255	-4036.106	1083.831
x2	-415.3144	2314.750	-0.179	0.858	-5011.949	4181.320
x3	-9.383e+04	9.4e+04	-0.998	0.321	-2.8e+05	9.28e+04
x4	-485.0276	671.883	-0.722	0.472	-1819.254	849.199
x5	498.2459	695.803	0.716	0.476	-883.480	1879.972
x6	-262.5237	769.196	-0.341	0.734	-1789.995	1264.948
x7	-1371.0060	1559.650	-0.879	0.382	-4468.162	1726.150
x8	-164.7210	760.933	-0.216	0.829	-1675.784	1346.342
x9	3946.0617	2921.829	1.351	0.180	-1856.111	9748.235
x10	269.1233	721.020	0.373	0.710	-1162.679	1700.926
x11	1.024e+04	2523.966	4.057	0.000	5226.545	1.53e+04
x12	7.742e+04	7.93e+04	0.977	0.331	-8e+04	2.35e+05
x13	-2426.3684	2943.799	-0.824	0.412	-8272.169	3419.432
x14	-285.5830	781.560	-0.365	0.716	-1837.606	1266.440
x15	111.1761	758.548	0.147	0.884	-1395.150	1617.502
x16	7587.0753	6254.661	1.213	0.228	-4833.443	2e+04
x17	1266.8570	1238.036	1.023	0.309	-1191.636	3725.350
x18	-972.1837	817.114	-1.190	0.237	-2594.810	650.443
x19	5379.1903	7262.214	0.741	0.461	-9042.128	1.98e+04
x20	-4781.4961	5471.265	-0.874	0.384	-1.56e+04	6083.352
x21	-249.8717	1291.108	-0.194	0.847	-2813.757	2314.014
x22	235.2476	2207.965	0.107	0.915	-4149.333	4619.828
x23	1.907e+04	1266.567	15.055	0.000	1.66e+04	2.16e+04
x24	851.2121	6602.114	0.129	0.898	-1.23e+04	1.4e+04
x25	1297.3310	1929.556	0.672	0.503	-2534.385	5129.047
x26	1199.4709	720.099	1.666	0.099	-230.503	2629.444
x27	-931.9918	1632.526	-0.571	0.569	-4173.865	2309.882
x28	1.808e+04	1.67e+04	1.082	0.282	-1.51e+04	5.13e+04
Omnibus:	28.069	Durbin-Watson:	2.025
Prob(Omnibus):	0.000	Jarque-Bera (JB):	194.274
Skew:	-0.405	Prob(JB):	6.52e-43
Kurtosis:	9.155	Cond. No.	1.23e+16


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 5.36e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

# r_squared 결정계수
독립변수의 변동량으로 설명되는 종속변수의 변동량
상관계수의 제곱과 가다

# adj 수정결정계수
독립변수가 많아지는 경우 결정계수값이 커질수있어, 표본의 크기와 독립변수의 수를 고려하여
다중회귀분석을 수행하는 경우
p>|t| 각피처의 검정통계량(f statistics )이 유의미한지를 나타내는 pvalue 값
p value < 0.05 이면 피처가 회귀분석에 유의미한 피처다
이분석에서는 war 연복2017 한화 3개가 0.05미만
=> 회귀분석에서 유의미한 피처들

x_train = sm.add_constant(x_train)
model = sm.OLS(y_train, x_train).fit()
model.summary()

OLS Regression Results
Dep. Variable:	y	R-squared:	0.928
Model:	OLS	Adj. R-squared:	0.907
Method:	Least Squares	F-statistic:	44.19
Date:	Tue, 20 Jul 2021	Prob (F-statistic):	7.70e-42
Time:	10:53:15	Log-Likelihood:	-1247.8
No. Observations:	121	AIC:	2552.
Df Residuals:	93	BIC:	2630.
Df Model:	27		
Covariance Type:	nonrobust		
coef	std err	t	P>|t|	[0.025	0.975]
const	1.678e+04	697.967	24.036	0.000	1.54e+04	1.82e+04
BABIP	-1481.0173	1293.397	-1.145	0.255	-4049.448	1087.414
ERA	-416.6874	2322.402	-0.179	0.858	-5028.517	4195.143
FIP	-9.414e+04	9.43e+04	-0.998	0.321	-2.81e+05	9.31e+04
KIA	303.1852	2222.099	0.136	0.892	-4109.462	4715.833
KT	3436.0520	2133.084	1.611	0.111	-799.831	7671.935
LG	1116.9978	2403.317	0.465	0.643	-3655.513	5889.509
LOB%	-1375.5383	1564.806	-0.879	0.382	-4482.933	1731.857
NC	1340.5004	2660.966	0.504	0.616	-3943.651	6624.652
RA9-WAR	3959.1065	2931.488	1.351	0.180	-1862.247	9780.460
SK	2762.4237	2243.540	1.231	0.221	-1692.803	7217.650
WAR	1.027e+04	2532.309	4.057	0.000	5243.823	1.53e+04
kFIP	7.767e+04	7.95e+04	0.977	0.331	-8.03e+04	2.36e+05
경기	-2434.3895	2953.530	-0.824	0.412	-8299.515	3430.736
두산	971.9293	2589.849	0.375	0.708	-4170.998	6114.857
롯데	2313.9585	2566.009	0.902	0.370	-2781.627	7409.544
볼넷/9	7612.1566	6275.338	1.213	0.228	-4849.421	2.01e+04
블론	1271.0450	1242.128	1.023	0.309	-1195.576	3737.666
삼성	-946.5092	2482.257	-0.381	0.704	-5875.780	3982.762
삼진/9	5396.9728	7286.221	0.741	0.461	-9072.019	1.99e+04
선발	-4797.3028	5489.352	-0.874	0.384	-1.57e+04	6103.463
세	-250.6977	1295.377	-0.194	0.847	-2823.059	2321.663
승	236.0253	2215.264	0.107	0.915	-4163.049	4635.100
연봉(2017)	1.913e+04	1270.754	15.055	0.000	1.66e+04	2.17e+04
이닝	854.0260	6623.940	0.129	0.898	-1.23e+04	1.4e+04
패	1301.6197	1935.935	0.672	0.503	-2542.763	5146.003
한화	5477.8879	2184.273	2.508	0.014	1140.355	9815.421
홀드	-935.0728	1637.923	-0.571	0.569	-4187.663	2317.518
홈런/9	1.814e+04	1.68e+04	1.082	0.282	-1.52e+04	5.14e+04
Omnibus:	28.069	Durbin-Watson:	2.025
Prob(Omnibus):	0.000	Jarque-Bera (JB):	194.274
Skew:	-0.405	Prob(JB):	6.52e-43
Kurtosis:	9.155	Cond. No.	3.63e+16


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 6.04e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

plt.rcParams['figure.figsize'] = [20, 16]
plt.rc('font', family = 'Malgun Gothic')
coefs = model.params.tolist()
coefs_series = pd.Series(coefs)
x_labels = model.params.index.tolist()
ax = coefs_series.plot(kind = 'bar')
ax.set_title('feature_coef_graph')
ax.set_xlabel('x_feature')
ax.set_ylabel('coef')
ax.set_xticklabels(x_labels)

[Text(0, 0, 'const'),
 Text(1, 0, 'BABIP'),
 Text(2, 0, 'ERA'),
 Text(3, 0, 'FIP'),
 Text(4, 0, 'KIA'),
 Text(5, 0, 'KT'),
 Text(6, 0, 'LG'),
 Text(7, 0, 'LOB%'),
 Text(8, 0, 'NC'),
 Text(9, 0, 'RA9-WAR'),
 Text(10, 0, 'SK'),
 Text(11, 0, 'WAR'),
 Text(12, 0, 'kFIP'),
 Text(13, 0, '경기'),
 Text(14, 0, '두산'),
 Text(15, 0, '롯데'),
 Text(16, 0, '볼넷/9'),
 Text(17, 0, '블론'),
 Text(18, 0, '삼성'),
 Text(19, 0, '삼진/9'),
 Text(20, 0, '선발'),
 Text(21, 0, '세'),
 Text(22, 0, '승'),
 Text(23, 0, '연봉(2017)'),
 Text(24, 0, '이닝'),
 Text(25, 0, '패'),
 Text(26, 0, '한화'),
 Text(27, 0, '홀드'),
 Text(28, 0, '홈런/9')]

다중공선성이 높으면 상관성이 너무 높은 것
안정적인 분석을 위해서 안써야함

from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(x.values, i) for i in range(x.shape[1])]
vif['features'] = x.columns
vif.round(1)

	VIF Factor	features
0	3.2	BABIP
1	10.6	ERA
2	14238.3	FIP
3	1.1	KIA
4	1.1	KT
5	1.1	LG
6	4.3	LOB%
7	1.1	NC
8	13.6	RA9-WAR
9	1.1	SK
10	10.4	WAR
11	10264.1	kFIP
12	14.6	경기
13	1.2	두산
14	1.1	롯데
15	57.8	볼넷/9
16	3.0	블론
17	1.2	삼성
18	89.5	삼진/9
19	39.6	선발
20	3.1	세
21	8.0	승
22	2.5	연봉(2017)
23	63.8	이닝
24	5.9	패
25	1.1	한화
26	3.8	홀드
27	425.6	홈런/9

변수간 상관관계가 높아서 분석에 부정적인 영향을 미침
vif 평가 : 분산팽창요인
    보통 10~15 정도를 넘으면 다중공선성에 문제가 있다고 판단
    홈런, 이닝, 선발, 삼진, 볼넷, 경기, kfip,fip
    특히 이둘은 너무 유사해서 상승효과가 생김, 그래서 하나는 빼버려야함
1. vif 계수 높은 피처 제거, 유사피처중 한개만 제거
2. 다시모델을 실행해서 공선성 검증
3. 분석결과에서 p-value값이 유의미한 피처들을 선정

# 적절한 피처를 선정해서 다시 학습하기
# 피처간 상관계수를 그래프로 작성
scale_columns = ['승','패','세','홀드','블론','경기','선발','이닝','삼진/9',
                    '볼넷/9','홈런/9','BABIP','LOB%','ERA','RA9-WAR','FIP','kFIP','WAR','연봉(2017)']
picher_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152 entries, 0 to 151
Data columns (total 30 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   선수명       152 non-null    object 
 1   승         152 non-null    float64
 2   패         152 non-null    float64
 3   세         152 non-null    float64
 4   홀드        152 non-null    float64
 5   블론        152 non-null    float64
 6   경기        152 non-null    float64
 7   선발        152 non-null    float64
 8   이닝        152 non-null    float64
 9   삼진/9      152 non-null    float64
 10  볼넷/9      152 non-null    float64
 11  홈런/9      152 non-null    float64
 12  BABIP     152 non-null    float64
 13  LOB%      152 non-null    float64
 14  ERA       152 non-null    float64
 15  RA9-WAR   152 non-null    float64
 16  FIP       152 non-null    float64
 17  kFIP      152 non-null    float64
 18  WAR       152 non-null    float64
 19  y         152 non-null    int64  
 20  연봉(2017)  152 non-null    float64
 21  KIA       152 non-null    uint8  
 22  KT        152 non-null    uint8  
 23  LG        152 non-null    uint8  
 24  NC        152 non-null    uint8  
 25  SK        152 non-null    uint8  
 26  두산        152 non-null    uint8  
 27  롯데        152 non-null    uint8  
 28  삼성        152 non-null    uint8  
 29  한화        152 non-null    uint8  
dtypes: float64(19), int64(1), object(1), uint8(9)
memory usage: 26.4+ KB

corr = picher_df[scale_columns].corr(method='pearson')
corr

	승	패	세	홀드	블론	경기	선발	이닝	삼진/9	볼넷/9	홈런/9	BABIP	LOB%	ERA	RA9-WAR	FIP	kFIP	WAR	연봉(2017)
승	1.000000	0.710749	0.053747	0.092872	0.105281	0.397074	0.773560	0.906093	0.078377	-0.404710	-0.116147	-0.171111	0.131178	-0.271086	0.851350	-0.303133	-0.314159	0.821420	0.629710
패	0.710749	1.000000	0.066256	0.098617	0.121283	0.343147	0.771395	0.829018	0.031755	-0.386313	-0.064467	-0.133354	-0.020994	-0.188036	0.595989	-0.233416	-0.238688	0.625641	0.429227
세	0.053747	0.066256	1.000000	0.112716	0.605229	0.434290	-0.177069	0.020278	0.170436	-0.131394	-0.073111	-0.089212	0.167557	-0.150348	0.167669	-0.199746	-0.225259	0.084151	0.262664
홀드	0.092872	0.098617	0.112716	1.000000	0.490076	0.715527	-0.285204	0.024631	0.186790	-0.146806	-0.076475	-0.104307	0.048123	-0.155712	0.003526	-0.211515	-0.237353	-0.038613	-0.001213
블론	0.105281	0.121283	0.605229	0.490076	1.000000	0.630526	-0.264160	0.014176	0.188423	-0.137019	-0.064804	-0.112480	0.100633	-0.160761	0.008766	-0.209014	-0.237815	-0.058213	0.146584
경기	0.397074	0.343147	0.434290	0.715527	0.630526	1.000000	-0.037443	0.376378	0.192487	-0.364293	-0.113545	-0.241608	0.105762	-0.320177	0.281595	-0.345351	-0.373777	0.197836	0.225357
선발	0.773560	0.771395	-0.177069	-0.285204	-0.264160	-0.037443	1.000000	0.894018	-0.055364	-0.312935	-0.058120	-0.098909	0.041819	-0.157775	0.742258	-0.151040	-0.142685	0.758846	0.488559
이닝	0.906093	0.829018	0.020278	0.024631	0.014176	0.376378	0.894018	1.000000	0.037343	-0.451101	-0.107063	-0.191514	0.103369	-0.285392	0.853354	-0.296768	-0.302288	0.832609	0.586874
삼진/9	0.078377	0.031755	0.170436	0.186790	0.188423	0.192487	-0.055364	0.037343	1.000000	0.109345	0.216017	0.457523	-0.071284	0.256840	0.102963	-0.154857	-0.317594	0.151791	0.104948
볼넷/9	-0.404710	-0.386313	-0.131394	-0.146806	-0.137019	-0.364293	-0.312935	-0.451101	0.109345	1.000000	0.302251	0.276009	-0.150837	0.521039	-0.398586	0.629833	0.605008	-0.394131	-0.332379
홈런/9	-0.116147	-0.064467	-0.073111	-0.076475	-0.064804	-0.113545	-0.058120	-0.107063	0.216017	0.302251	1.000000	0.362614	-0.274543	0.629912	-0.187210	0.831042	0.743623	-0.205014	-0.100896
BABIP	-0.171111	-0.133354	-0.089212	-0.104307	-0.112480	-0.241608	-0.098909	-0.191514	0.457523	0.276009	0.362614	1.000000	-0.505478	0.733109	-0.187058	0.251126	0.166910	-0.082995	-0.088754
LOB%	0.131178	-0.020994	0.167557	0.048123	0.100633	0.105762	0.041819	0.103369	-0.071284	-0.150837	-0.274543	-0.505478	1.000000	-0.720091	0.286893	-0.288050	-0.269536	0.144191	0.110424
ERA	-0.271086	-0.188036	-0.150348	-0.155712	-0.160761	-0.320177	-0.157775	-0.285392	0.256840	0.521039	0.629912	0.733109	-0.720091	1.000000	-0.335584	0.648004	0.582057	-0.261508	-0.203305
RA9-WAR	0.851350	0.595989	0.167669	0.003526	0.008766	0.281595	0.742258	0.853354	0.102963	-0.398586	-0.187210	-0.187058	0.286893	-0.335584	1.000000	-0.366308	-0.377679	0.917299	0.643375
FIP	-0.303133	-0.233416	-0.199746	-0.211515	-0.209014	-0.345351	-0.151040	-0.296768	-0.154857	0.629833	0.831042	0.251126	-0.288050	0.648004	-0.366308	1.000000	0.984924	-0.391414	-0.268005
kFIP	-0.314159	-0.238688	-0.225259	-0.237353	-0.237815	-0.373777	-0.142685	-0.302288	-0.317594	0.605008	0.743623	0.166910	-0.269536	0.582057	-0.377679	0.984924	1.000000	-0.408283	-0.282666
WAR	0.821420	0.625641	0.084151	-0.038613	-0.058213	0.197836	0.758846	0.832609	0.151791	-0.394131	-0.205014	-0.082995	0.144191	-0.261508	0.917299	-0.391414	-0.408283	1.000000	0.675794
연봉(2017)	0.629710	0.429227	0.262664	-0.001213	0.146584	0.225357	0.488559	0.586874	0.104948	-0.332379	-0.100896	-0.088754	0.110424	-0.203305	0.643375	-0.268005	-0.282666	0.675794	1.000000

# 히트맵 시각화
import seaborn as sns
show_cols = ['win', 'lose','save','hold','blon','match','start','inning','strike3',
            'ball4','homerun','BABIP','LOB','ERA','RA9-WAR','FIP','kFIP','WAR','2017']
plt.rc('font', family = 'Nanum Gothic')
sns.set(font_scale=0.8)
hm = sns.heatmap(corr.values,
                cbar = True,
                annot = True,
                square = True,
                fmt = '.2f',
                annot_kws={'size':15},
                yticklabels = show_cols,
                xticklabels = show_cols)
plt.tight_layout()
plt.show()

x = picher_df[['FIP','WAR','볼넷/9','삼진/9','연봉(2017)']]
y = picher_df['y']

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=19)

# 모델학습하기
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
model = lr.fit(x_train, y_train)

# r2
print(model.score(x_train, y_train))
print(model.score(x_test, y_test))

# 0.9150591192570362
# 0.9038759653889865

# rmse 평가
# mse 평균제곱오차
from math import sqrt
from sklearn.metrics import mean_squared_error
y_pred = lr.predict(x_train)
print(sqrt(mean_squared_error(y_train, y_pred)))
y_pred = lr.predict(x_test)
print(sqrt(mean_squared_error(y_test, y_pred)))

# 7893.462873347693
# 13141.86606359108

# 피처별 vif 공분산
from statsmodels.stats.outliers_influence import variance_inflation_factor
x = picher_df[['FIP','WAR','볼넷/9','삼진/9','연봉(2017)']]
vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(x.values, i) for i in range(x.shape[1])]
vif['features'] = x.columns
vif.round(1)


	VIF Factor	features
0	1.9	FIP
1	2.1	WAR
2	1.9	볼넷/9
3	1.1	삼진/9
4	1.9	연봉(2017)

# 시각화\ 비교
# 모든 데이터 검증
# lr 학습이 완료된 객체
x = picher_df[['FIP','WAR','볼넷/9','삼진/9','연봉(2017)']]
predict_2018_salary = lr.predict(x)
predict_2018_salary[:5]
picher_df['예측연봉(2018)'] = pd.Series(predict_2018_salary)

picher = pd.read_csv(picher_file_path)
picher = picher[['선수명','연봉(2017)']]

# 2018년 연봉 내림차순
result_df = picher_df.sort_values(by=['y'], ascending = False)
# 연봉2017 삭제, 정규화된 데이터, 실제데이터가 아님
result_df.drop(['연봉(2017)'], axis=1, inplace=True, errors='ignore')
# 연봉 2017의 실제데이터로 컬럼 변경
result_df = result_df.merge(picher, on=['선수명'], how='left')
result_df = result_df[['선수명', 'y','예측연봉(2018)','연봉(2017)']]
result_df.columns = ['선수명','실제연봉(2018)','예측연봉(2018)','작년연봉(2017)']

result_df
	선수명	실제연봉(2018)	예측연봉(2018)	작년연봉(2017)
0	양현종	230000	163930.148696	150000
1	켈리	140000	120122.822204	85000
2	소사	120000	88127.019455	50000
3	정우람	120000	108489.464585	120000
4	레일리	111000	102253.697589	85000
...	...	...	...	...
147	장지훈	2800	249.850641	2700
148	차재용	2800	900.811527	2800
149	성영훈	2700	5003.619609	2700
150	정동윤	2700	2686.350884	2700
151	장민익	2700	3543.781665	2700
152 rows × 4 columns

result_df = result_df.iloc[:10,:]
plt.rc('font', family = 'Malgun Gothic')
result_df.plot(x='선수명', y=['작년연봉(2017)','예측연봉(2018)','실제연봉(2018)'], kind='bar')

# 2017연봉과 2018년 연봉이 다른 선수들만 
result_df = result_df[result_df['작년연봉(2017)'] != result_df['예측연봉(2018)']]
result_df.head()
	선수명	실제연봉(2018)	예측연봉(2018)	작년연봉(2017)
0	양현종	230000	163930.148696	150000
1	켈리	140000	120122.822204	85000
2	소사	120000	88127.019455	50000
3	정우람	120000	108489.464585	120000
4	레일리	111000	102253.697589	85000

result_df = result_df.reset_index()
result_df.head()

	index	선수명	실제연봉(2018)	예측연봉(2018)	작년연봉(2017)
0	0	양현종	230000	163930.148696	150000
1	1	켈리	140000	120122.822204	85000
2	2	소사	120000	88127.019455	50000
3	3	정우람	120000	108489.464585	120000
4	4	레일리	111000	102253.697589	85000

result_df = result_df.iloc[:10, :]
result_df.plot(x='선수명', y=['작년연봉(2017)','예측연봉(2018)','실제연봉(2018)'],kind = 'bar')

저작자표시 비영리 (새창열림)

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

29. 비트코인 시계열 분석 \|\| prophet (0)	2021.11.24
28. 비트코인 가격 시계열 분석 \|\| Arima, fbProphet (0)	2021.11.24
26. 서울 중학교 졸업자 분석 \|\| dbscan, folium (0)	2021.11.24
25. 판매 데이터 분석 \|\| kmeans (0)	2021.11.24
24. 위스콘신 유방안데이터 분석 \|\| DT (0)	2021.11.24

26. 서울 중학교 졸업자 분석 || dbscan, folium

2021. 11. 24. 14:29

728x90

# dbscan density based clustering => 데이터 위치로부터 공간밀집도중심 클러스터 구분
noise 처리

2016_middle_shcool_graduates_report.xlsx

0.06MB

import pandas as pd
import numpy as np
import folium
file_path = '2016_middle_shcool_graduates_report.xlsx'
df = pd.read_excel(file_path, engine='openpyxl', header = 0,)
pd.set_option('display.width', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_colwidth', 20)

df.columns.values

array(['지역', '학교명', '코드', '유형', '주야', '남학생수', '여학생수', '일반고', '특성화고',
       '과학고', '외고_국제고', '예고_체고', '마이스터고', '자사고', '자공고', '기타진학', '취업',
       '미상', '위도', '경도'], dtype=object)

df.head()

	지역	학교명	코드	유형	주야	...	기타진학	취업	미상	위도	경도
0	성북구	서울대학교사범대학부설중학교	3	국립	주간	...	0.004	0	0.000	37.594942	127.038909
1	종로구	서울대학교사범대학부설여자중학교	3	국립	주간	...	0.031	0	0.000	37.577473	127.003857
2	강남구	개원중학교	3	공립	주간	...	0.009	0	0.003	37.491637	127.071744
3	강남구	개포중학교	3	공립	주간	...	0.019	0	0.000	37.480439	127.062201
4	서초구	경원중학교	3	공립	주간	...	0.010	0	0.000	37.510750	127.008900

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 415 entries, 0 to 414
Data columns (total 20 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   지역      415 non-null    object 
 1   학교명     415 non-null    object 
 2   코드      415 non-null    int64  
 3   유형      415 non-null    object 
 4   주야      415 non-null    object 
 5   남학생수    415 non-null    int64  
 6   여학생수    415 non-null    int64  
 7   일반고     415 non-null    float64
 8   특성화고    415 non-null    float64
 9   과학고     415 non-null    float64
 10  외고_국제고  415 non-null    float64
 11  예고_체고   415 non-null    float64
 12  마이스터고   415 non-null    float64
 13  자사고     415 non-null    float64
 14  자공고     415 non-null    float64
 15  기타진학    415 non-null    float64
 16  취업      415 non-null    int64  
 17  미상      415 non-null    float64
 18  위도      415 non-null    float64
 19  경도      415 non-null    float64
dtypes: float64(12), int64(4), object(4)
memory usage: 65.0+ KB

# 중학교 정보 지도표시
import folium
import json
mschool_map = folium.Map(location=[37.55, 126.98], zoom_start=12)
for name, lat, lng in zip(df.학교명, df.위도, df.경도):
    folium.CircleMarker([lat, lng],
                       radius = 5,
                       color = 'brown',
                       fill = True,
                       fill_color = 'coral',
                       fill_opacity = 0.7,
                       popup = name,
                       tooltip=name).add_to(mschool_map)
mschool_map.save('./seoul_mschool_loca.html')

seoul_mschool_loca.html

0.48MB

# 전처리 : 지역, 유형, 주야 컬럼 원핫인코디변환
df['코드'].unique()

# array([3, 5, 9], dtype=int64)

from sklearn import preprocessing as pp
label_encoder = pp.LabelEncoder()
# 문자열 => 수치형, 숫자의 크기저오는 의미없음, 단순 종류표시
label_location = label_encoder.fit_transform(df['지역']) 
label_code = label_encoder.fit_transform(df['코드']) 
label_type = label_encoder.fit_transform(df['유형']) 
label_day = label_encoder.fit_transform(df['주야'])

# onehot_encoder = pp.OneHotEncoder()
df['location'] = label_location

df['location'] = label_location
df['type'] = label_type
df['code'] = label_code
df['day'] = label_day
df.head()
	지역	학교명	코드	유형	주야	...	경도	location	type	code	day
0	성북구	서울대학교사범대학부설중학교	3	국립	주간	...	127.038909	16	1	0	0
1	종로구	서울대학교사범대학부설여자중학교	3	국립	주간	...	127.003857	22	1	0	0
2	강남구	개원중학교	3	공립	주간	...	127.071744	0	0	0	0
3	강남구	개포중학교	3	공립	주간	...	127.062201	0	0	0	0
4	서초구	경원중학교	3	공립	주간	...	127.008900	14	0	0	0
5 rows × 24 columns

label_location
array([16, 22,  0,  0, 14,  0,  0,  0,  0,  0,  0,  0,  0, 14, 14, 14,  0,
       14, 14, 14, 14, 14,  0,  0,  0, 14, 14,  0, 14,  0,  0,  0, 14, 14,
        0, 14,  0,  0,  0,  0, 17, 17,  1, 17,  1,  1,  1,  1,  1, 17, 17,
       17, 17,  1, 17, 17,  1, 17,  1,  1, 17, 17,  1,  1, 17, 17, 17, 17,
       17, 17, 17, 17, 17, 17,  1,  1, 17, 17,  1,  1, 18,  3,  3,  3, 18,
        3,  3,  3,  3,  3,  3, 18, 18,  3,  3,  3, 18,  3,  3,  3, 18, 18,
       18, 18, 18,  3, 18, 18, 18, 18, 18, 18,  3, 18, 18,  3,  3,  7,  6,
        6,  6,  6,  6,  7, 19, 19, 19, 19,  7, 19,  7,  7,  7,  7,  6,  7,
       19, 19, 19, 19,  6,  6, 19,  6,  6,  6,  6, 19,  7, 10, 10, 10, 10,
       10, 24, 24, 24, 24, 10, 24, 10, 24, 24, 24, 24, 24, 10, 10, 10, 10,
       10, 24, 24, 10, 24, 24, 10, 10, 11, 11,  4,  4, 11,  4,  4,  4, 11,
        4, 11, 11, 11, 11,  4,  4,  4, 11, 11, 11,  4, 11,  4,  4,  4,  4,
       11,  4, 11, 11,  8,  8,  9,  8,  8,  8,  9,  9,  9,  9,  8,  8,  8,
        8,  8,  8,  9,  8,  9,  9,  8,  8,  8,  8,  8,  8,  9,  8,  8,  8,
        9,  9,  9,  8,  8,  8,  8, 12, 12, 21, 21, 21, 12, 13, 13, 21, 21,
       13, 12, 21, 21, 12, 12, 12, 12, 21, 12, 13, 12, 13, 21, 21, 21, 13,
       21, 21, 21, 13, 13, 13, 12, 13, 21, 21, 13, 13, 12,  5, 15,  5,  5,
        5,  5, 15,  5,  5, 15,  5, 15, 15, 15,  5, 15,  5,  5, 15, 15,  2,
       16, 16, 16, 16,  2, 16, 16,  2, 16, 16,  2,  2,  2,  2,  2, 16, 16,
        2, 16, 16,  2, 16, 16,  2, 22, 23, 23, 22, 22, 23, 22, 20, 22, 20,
       22, 20, 20, 11, 20, 20, 20, 20, 23, 23, 22, 23, 22, 20, 23, 23, 17,
        2,  8,  4, 15, 15, 16,  5,  3,  9, 12,  3, 21, 18,  2, 13, 17,  1,
        1, 21, 12,  6, 13, 16,  3, 16,  0, 17, 22, 12, 22,  3, 14,  0,  4,
        5,  8, 11,  2,  9,  4,  8,  2,  6,  6, 13,  0,  1,  1, 17,  2, 21,
       22, 16,  0,  7,  5, 23,  8])

#
from sklearn import cluster
#분석에 사용할 속성을 선택( 과고, 외고, 자사고)
columns_list = [9,10,13]
x = df.iloc[:,columns_list]
x = pp.StandardScaler().fit(x).transform(x)
print(x[:5])

[[ 2.02375287 -0.57972902  1.84751715]
 [-0.65047921  1.84782097 -0.48039958]
 [ 0.68663683 -0.14623795  0.11423133]
 [ 1.28091062 -0.05953974 -0.20206171]
 [ 0.38949993 -0.31963438  2.54336183]]

#dbscan 모형
# eps 반지름값, min_samples 클러슽터의 포인트가 최소 5개는 되어야 클러스터로 인정
dbm = cluster.DBSCAN(eps=0.2, min_samples = 5)
# 데이터 학습
dbm.fit(x)
cluster_label = dbm.labels_
print(cluster_label)

[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  0 -1 -1 -1 -1 -1 -1  2 -1  0 -1
 -1 -1 -1 -1  0 -1 -1 -1 -1 -1  0  3 -1 -1 -1 -1 -1 -1 -1  0 -1 -1  1  0
 -1 -1 -1  0 -1 -1 -1 -1  0 -1  0  0 -1 -1  0 -1 -1 -1  0  0 -1 -1  0 -1
 -1 -1  0 -1 -1 -1  0  2  0  0  0  0  0 -1 -1 -1  0 -1  0 -1 -1  0 -1  0
 -1  0  0 -1 -1 -1 -1  1  0 -1  0  0 -1 -1 -1  0 -1 -1 -1 -1 -1  0  1 -1
 -1  0  2  0 -1 -1  1 -1 -1 -1  0  0  0 -1 -1  0 -1 -1 -1  0  0 -1 -1 -1
 -1  0 -1 -1 -1  0 -1 -1 -1  0 -1  0  0 -1 -1 -1 -1 -1  0 -1  0  0 -1 -1
 -1 -1 -1  0 -1 -1 -1  1  0  3  1 -1  0  0 -1  0 -1 -1  0  0  2 -1 -1  3
  0  0 -1 -1 -1 -1  0 -1  0  0 -1  0  0  0 -1 -1  0 -1 -1 -1 -1 -1  2  0
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  0 -1 -1 -1  0 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1  0 -1 -1 -1  0  0 -1 -1  0 -1  3  0  2 -1 -1
 -1 -1  0 -1 -1 -1  0 -1  0  0 -1 -1 -1 -1 -1  1 -1  0  1 -1  0  0  1 -1
  2 -1  0 -1 -1 -1 -1  0 -1 -1  1  0 -1  0 -1 -1  0  3  0 -1 -1 -1  2 -1
 -1 -1 -1  0  0  0  1 -1 -1 -1 -1 -1 -1 -1 -1  0 -1  0 -1  0 -1 -1  0  0
 -1 -1 -1  0 -1  0 -1 -1  0 -1 -1 -1  0  1 -1 -1 -1  0  1  1  1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  0 -1 -1 -1  0 -1  0
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  0]

df['Cluster'] = cluster_label

df.head()

	지역	학교명	코드	유형	주야	...	location	type	code	day	Cluster
0	성북구	서울대학교사범대학부설중학교	3	국립	주간	...	16	1	0	0	-1
1	종로구	서울대학교사범대학부설여자중학교	3	국립	주간	...	22	1	0	0	-1
2	강남구	개원중학교	3	공립	주간	...	0	0	0	0	-1
3	강남구	개포중학교	3	공립	주간	...	0	0	0	0	-1
4	서초구	경원중학교	3	공립	주간	...	14	0	0	0	-1

# 클러스터별 그룹
# -1 노이즈 그룹, 어디에도 속하지 못함
grouped = df.groupby('Cluster')
grouped.sum()

	코드	남학생수	여학생수	일반고	특성화고	...	경도	location	type	code	day
Cluster											
-1	765	38505	30866	170.996	35.234	...	32395.479457	2877	142	0	0
0	312	10314	13927	69.275	21.253	...	12956.408362	1124	64	2	0
1	211	1790	1891	9.968	3.613	...	5715.272378	489	53	34	0
2	24	1174	1069	5.268	1.157	...	1016.535379	60	2	0	0
3	15	728	459	3.071	0.862	...	634.912972	49	0	0	0
5 rows × 20 columns

for k, g in grouped:
    print("* key :", k)
    print("* g :", len(g))
    print(g.iloc[:,[0,1,3,9,10,13]].head())
    print('\n')
    
* key : -1
* g : 255
    지역               학교명  유형    과학고  외고_국제고    자사고
0  성북구    서울대학교사범대학부설중학교  국립  0.018   0.007  0.227
1  종로구  서울대학교사범대학부설여자중학교  국립  0.000   0.035  0.043
2  강남구             개원중학교  공립  0.009   0.012  0.090
3  강남구             개포중학교  공립  0.013   0.013  0.065
4  서초구             경원중학교  공립  0.007   0.010  0.282


* key : 0
* g : 102
     지역      학교명  유형  과학고  외고_국제고    자사고
13  서초구  동덕여자중학교  사립  0.0   0.022  0.038
22  강남구    수서중학교  공립  0.0   0.019  0.044
28  서초구    언남중학교  공립  0.0   0.015  0.050
34  강남구    은성중학교  사립  0.0   0.016  0.065
43  송파구    거원중학교  공립  0.0   0.021  0.054


* key : 1
* g : 45
       지역      학교명  유형  과학고  외고_국제고    자사고
46    강동구    동신중학교  사립  0.0     0.0  0.044
103   양천구    신원중학교  공립  0.0     0.0  0.006
118   구로구    개봉중학교  공립  0.0     0.0  0.012
126  영등포구    대림중학교  공립  0.0     0.0  0.050
175   중랑구  혜원여자중학교  사립  0.0     0.0  0.004


* key : 2
* g : 8
      지역    학교명  유형    과학고  외고_국제고    자사고
20   서초구  서초중학교  공립  0.003   0.013  0.085
79   강동구  한영중학교  사립  0.004   0.011  0.077
122  구로구  구일중학교  공립  0.004   0.012  0.079
188  동작구  대방중학교  공립  0.003   0.015  0.076
214  도봉구  도봉중학교  공립  0.004   0.011  0.072


* key : 3
* g : 5
       지역    학교명  유형  과학고  외고_국제고    자사고
35    서초구  이수중학교  공립  0.0   0.004  0.100
177  동대문구  휘경중학교  공립  0.0   0.004  0.094
191   동작구  문창중학교  공립  0.0   0.004  0.084
259   마포구  성사중학교  공립  0.0   0.004  0.078
305   강북구  강북중학교  공립  0.0   0.004  0.088

# 지도색 표시
colors = {-1 : 'gray', 0 : 'coral', 1 : 'blue', 2 : 'green', 3  : 'red',
          4 : 'purple', 5 : 'orange', 6 : 'brown', 7 : 'brick', 
         8 : 'yellow', 9 : 'magenta', 10 : 'cyan', 11 : 'tan' }
cluster_map = folium.Map(location=[37.55, 126.98], zoom_start=12)
for name, lat, lng, clus in zip(df.학교명, df.위도, df.경도, df.Cluster) :
    folium.CircleMarker([lat, lng],
                       radius = 5,
                       color=colors[clus],
                       fill=True,
                       fill_color=colors[clus],
                       fill_opacity = 0.7,
                        popup = name,
                        tooltip=name).add_to(cluster_map)
cluster_map.save('seoul_school_cluster.html')

seoul_school_cluster.html

0.48MB

# 설명변수
# 과학고 외고 국제고 + 유형
col_list2 = [9,10,13,22]
x2 = df.iloc[:, col_list2]
print(x2[:5])
x2 = pp.StandardScaler().fit(x2).transform(x2)
dbm2 = cluster.DBSCAN(eps=0.2, min_samples=5)
dbm2.fit(x2)
df['Cluster2'] = dbm2.labels_
grouped2_cols = [0,1,3] + col_list2
grouped2_cols

     과학고  외고_국제고    자사고  code
0  0.018   0.007  0.227     0
1  0.000   0.035  0.043     0
2  0.009   0.012  0.090     0
3  0.013   0.013  0.065     0
4  0.007   0.010  0.282     0

[0, 1, 3, 9, 10, 13, 22]

df['Cluster2'].value_counts()

-1    260
 0    101
 4     26
 1     15
 2      8
 3      5
Name: Cluster2, dtype: int64

grouped2 = df.groupby('Cluster2')
for k, g in grouped2:
    print("* key :", k)
    print("* g :", len(g))
    print(g.iloc[:,[0,1,3,9,10,13]].head())
    print('\n')
    
* key : -1
* g : 260
    지역               학교명  유형    과학고  외고_국제고    자사고
0  성북구    서울대학교사범대학부설중학교  국립  0.018   0.007  0.227
1  종로구  서울대학교사범대학부설여자중학교  국립  0.000   0.035  0.043
2  강남구             개원중학교  공립  0.009   0.012  0.090
3  강남구             개포중학교  공립  0.013   0.013  0.065
4  서초구             경원중학교  공립  0.007   0.010  0.282


* key : 0
* g : 101
     지역      학교명  유형  과학고  외고_국제고    자사고
13  서초구  동덕여자중학교  사립  0.0   0.022  0.038
22  강남구    수서중학교  공립  0.0   0.019  0.044
28  서초구    언남중학교  공립  0.0   0.015  0.050
34  강남구    은성중학교  사립  0.0   0.016  0.065
43  송파구    거원중학교  공립  0.0   0.021  0.054


* key : 1
* g : 15
       지역      학교명  유형  과학고  외고_국제고    자사고
46    강동구    동신중학교  사립  0.0     0.0  0.044
103   양천구    신원중학교  공립  0.0     0.0  0.006
118   구로구    개봉중학교  공립  0.0     0.0  0.012
126  영등포구    대림중학교  공립  0.0     0.0  0.050
175   중랑구  혜원여자중학교  사립  0.0     0.0  0.004


* key : 2
* g : 8
      지역    학교명  유형    과학고  외고_국제고    자사고
20   서초구  서초중학교  공립  0.003   0.013  0.085
79   강동구  한영중학교  사립  0.004   0.011  0.077
122  구로구  구일중학교  공립  0.004   0.012  0.079
188  동작구  대방중학교  공립  0.003   0.015  0.076
214  도봉구  도봉중학교  공립  0.004   0.011  0.072


* key : 3
* g : 5
       지역    학교명  유형  과학고  외고_국제고    자사고
35    서초구  이수중학교  공립  0.0   0.004  0.100
177  동대문구  휘경중학교  공립  0.0   0.004  0.094
191   동작구  문창중학교  공립  0.0   0.004  0.084
259   마포구  성사중학교  공립  0.0   0.004  0.078
305   강북구  강북중학교  공립  0.0   0.004  0.088


* key : 4
* g : 26
      지역     학교명  유형  과학고  외고_국제고  자사고
384  종로구   서울농학교  국립  0.0     0.0  0.0
385  마포구  한국우진학교  국립  0.0     0.0  0.0
386  종로구   서울맹학교  국립  0.0     0.0  0.0
387  강서구    교남학교  사립  0.0     0.0  0.0
388  서초구   다니엘학교  사립  0.0     0.0  0.0

cluster2_map = folium.Map(location = [37.55, 126.98], zoom_start=12)

for name, lat, lng, clus in zip(df.학교명, df.위도, df.경도, df.Cluster2) :
    folium.CircleMarker([lat, lng],
                       radius=5,
                       color=colors[clus],
                       fill=True,
                       fill_color=colors[clus],
                       fill_opacity=0.7,
                       popup=name,
                       tooltip=name).add_to(cluster2_map)
cluster2_map.save('./seoul_mschool_cluster2.html')

seoul_mschool_cluster2.html

0.48MB

저작자표시 비영리 (새창열림)

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

28. 비트코인 가격 시계열 분석 \|\| Arima, fbProphet (0)	2021.11.24
27. 프로야구 연봉 예측 분석 \|\| OLS, Heatmap (0)	2021.11.24
25. 판매 데이터 분석 \|\| kmeans (0)	2021.11.24
24. 위스콘신 유방안데이터 분석 \|\| DT (0)	2021.11.24
23. titanic 분류 예측 \| KNN, SVM (0)	2021.11.24

25. 판매 데이터 분석 || kmeans

2021. 11. 24. 14:12

728x90

# 비지도 학습 : 데이터셋 내부에 정답 없음
관측값을 몇개의 집단으로 나눔. 정답이 없는 상태에서 데이터의 유사성으로 판단 => 군집
군집 : kmeans : 데이터간의 유사성을 측정하는 기준으로 클러스터의 중심까지의 거리 이용.

import pandas as pd
import matplotlib.pyplot as plt
# 고객의 연간 구매금액을 상품카테고리별로 구분한 데이ㅓ
uci_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/\
00292/Wholesale%20customers%20data.csv'
df = pd.read_csv(uci_path, header = 0)
df.head()

	Channel	Region	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
0	2	3	12669	9656	7561	214	2674	1338
1	2	3	7057	9810	9568	1762	3293	1776
2	2	3	6353	8808	7684	2405	3516	7844
3	1	3	13265	1196	4221	6404	507	1788
4	2	3	22615	5410	7198	3915	1777	5185

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 440 entries, 0 to 439
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   Channel           440 non-null    int64
 1   Region            440 non-null    int64
 2   Fresh             440 non-null    int64
 3   Milk              440 non-null    int64
 4   Grocery           440 non-null    int64
 5   Frozen            440 non-null    int64
 6   Detergents_Paper  440 non-null    int64
 7   Delicassen        440 non-null    int64
dtypes: int64(8)
memory usage: 27.6 KB

x = df.iloc[:,:]
# 정규화
from sklearn import preprocessing
x = preprocessing.StandardScaler().fit(x).transform(x)
x[:5]

array([[ 1.44865163,  0.59066829,  0.05293319,  0.52356777, -0.04111489,
        -0.58936716, -0.04356873, -0.06633906],
       [ 1.44865163,  0.59066829, -0.39130197,  0.54445767,  0.17031835,
        -0.27013618,  0.08640684,  0.08915105],
       [ 1.44865163,  0.59066829, -0.44702926,  0.40853771, -0.0281571 ,
        -0.13753572,  0.13323164,  2.24329255],
       [-0.69029709,  0.59066829,  0.10011141, -0.62401993, -0.3929769 ,
         0.6871443 , -0.49858822,  0.09341105],
       [ 1.44865163,  0.59066829,  0.84023948, -0.05239645, -0.07935618,
         0.17385884, -0.23191782,  1.29934689]])

from sklearn import cluster
kms = cluster.KMeans(init = 'k-means++', n_clusters=5, n_init=10)
# init = kmeans++ 중심점 설정없이 최초설정
# n clusters 5종류의 클러스터로 설정
# n_init 10개로 시작
kms.fit(x)
cluster_label= kms.labels_
print(cluster_label)

[3 3 3 1 3 3 3 3 1 3 3 3 3 3 3 1 3 1 3 1 3 1 1 2 3 3 1 1 3 1 1 1 1 1 1 3 1
 3 3 1 1 1 3 3 3 3 3 4 3 3 1 1 3 3 1 1 4 3 1 1 3 4 3 3 1 4 1 3 1 1 1 2 1 3
 3 1 1 3 1 1 1 3 3 1 3 4 4 2 1 1 1 1 4 1 3 1 3 1 1 1 3 3 3 1 1 1 3 3 3 3 1
 3 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1
 1 1 1 1 1 1 1 3 3 1 3 3 3 1 1 3 3 3 3 1 1 1 3 3 1 3 1 3 1 1 1 1 1 2 1 2 1
 1 1 1 3 3 1 1 1 3 1 1 0 3 0 0 3 3 0 0 0 3 0 0 0 3 0 4 0 0 3 0 3 0 3 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 4 0 0 0 0 0 0 0
 0 0 0 0 0 3 0 3 0 3 0 0 0 0 1 1 1 1 1 1 3 1 3 1 1 1 1 1 1 1 1 1 1 1 3 0 3
 0 3 3 0 3 3 3 3 3 3 3 0 0 3 0 0 3 0 0 3 0 0 0 3 0 0 0 0 0 2 0 0 0 0 0 3 0
 4 0 3 0 0 0 0 3 3 1 3 1 1 3 3 1 3 1 3 1 3 1 1 1 3 1 1 1 1 1 1 1 3 1 1 1 1
 3 1 1 3 1 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1
 3 3 1 1 1 1 1 1 3 3 1 3 1 1 3 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1]

df['Cluster'] = cluster_label
df.head()

	Channel	Region	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen	Cluster
0	2	3	12669	9656	7561	214	2674	1338	3
1	2	3	7057	9810	9568	1762	3293	1776	3
2	2	3	6353	8808	7684	2405	3516	7844	3
3	1	3	13265	1196	4221	6404	507	1788	1
4	2	3	22615	5410	7198	3915	1777	5185	3

df.plot(kind = 'scatter', x ='Grocery', y = 'Frozen', c = 'Cluster', cmap = 'Set1', colorbar=False, figsize=(10,10))
df.plot(kind = 'scatter', x ='Milk', y = 'Delicassen', c = 'Cluster', cmap = 'Set1', colorbar=True, figsize=(10,10))
plt.show()

저작자표시 비영리 (새창열림)

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

27. 프로야구 연봉 예측 분석 \|\| OLS, Heatmap (0)	2021.11.24
26. 서울 중학교 졸업자 분석 \|\| dbscan, folium (0)	2021.11.24
24. 위스콘신 유방안데이터 분석 \|\| DT (0)	2021.11.24
23. titanic 분류 예측 \| KNN, SVM (0)	2021.11.24
22. auto-mpg \|\| 회귀분석 (0)	2021.11.24

24. 위스콘신 유방안데이터 분석 || DT

2021. 11. 24. 14:08

728x90

# Decision Tree
# node 분기점 : 분석되는 설명변수

from sklearn import tree
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import pandas as pd
import numpy as np

uci_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
df = pd.read_csv(uci_path, header=None)
df.head()


0	1	2	3	4	5	6	7	8	9	10
0	1000025	5	1	1	1	2	1	3	1	1	2
1	1002945	5	4	4	5	7	10	3	2	1	2
2	1015425	3	1	1	1	2	2	3	1	1	2
3	1016277	6	8	8	1	3	4	3	7	1	2
4	1017023	4	1	1	3	2	1	3	1	1	2

id : id번호
clump : 덩어리 두께
cell_size : 암세포 크기
cell_shape : 세포모양
adhesion : 한계
epithlial : 상피세포크기
bare_nuclei : 베어핵
chromatin : 염색질
normal_nucleoli : 정상세포
mitoses : 유사분열
class : 양성 음성

df.columns = ['id','clump', 'cell_size', 'cell_shape', 'adhesion', 'epithlial', \
              'bare_nuclei','chromatin', 'normal_nucleoli', 'mitoses', 'class']
df.head()

	id	clump	cell_size	cell_shape	adhesion	epithlial	bare_nuclei	chromatin	normal_nucleoli	mitoses	class
0	1000025	5	1	1	1	2	1	3	1	1	2
1	1002945	5	4	4	5	7	10	3	2	1	2
2	1015425	3	1	1	1	2	2	3	1	1	2
3	1016277	6	8	8	1	3	4	3	7	1	2
4	1017023	4	1	1	3	2	1	3	1	1	2

df['class'].value_counts()
2    458
4    241
Name: class, dtype: int64

df['bare_nuclei'].unique()

array(['1', '10', '2', '4', '3', '9', '7', '?', '5', '8', '6'],
      dtype=object)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   id               699 non-null    int64 
 1   clump            699 non-null    int64 
 2   cell_size        699 non-null    int64 
 3   cell_shape       699 non-null    int64 
 4   adhesion         699 non-null    int64 
 5   epithlial        699 non-null    int64 
 6   bare_nuclei      699 non-null    object
 7   chromatin        699 non-null    int64 
 8   normal_nucleoli  699 non-null    int64 
 9   mitoses          699 non-null    int64 
 10  class            699 non-null    int64 
dtypes: int64(10), object(1)
memory usage: 60.2+ KB

df.loc[df['bare_nuclei'] == '?', 'bare_nuclei'] = np.nan

df['bare_nuclei'].replace('?', np.nan, inplace =True)
df.dropna(subset=['bare_nuclei'], axis=0, inplace =True)
df['bare_nuclei'] = df['bare_nuclei'].astype(int)
df.info()
# 64비트, 8자리수, 32비트 4자리 // 1~10이니깐 상관없음

<class 'pandas.core.frame.DataFrame'>
Int64Index: 683 entries, 0 to 698
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   id               683 non-null    int64
 1   clump            683 non-null    int64
 2   cell_size        683 non-null    int64
 3   cell_shape       683 non-null    int64
 4   adhesion         683 non-null    int64
 5   epithlial        683 non-null    int64
 6   bare_nuclei      683 non-null    int32
 7   chromatin        683 non-null    int64
 8   normal_nucleoli  683 non-null    int64
 9   mitoses          683 non-null    int64
 10  class            683 non-null    int64
dtypes: int32(1), int64(10)
memory usage: 61.4 KB

x = df.iloc[:,1:-1]
y = df.iloc[:,-1]
y

0      2
1      2
2      2
3      2
4      2
      ..
694    2
695    2
696    4
697    4
698    4
Name: class, Length: 683, dtype: int64

# 정규화
x = preprocessing.StandardScaler().fit(x).transform(x)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size =0.3, random_state=10)
print(x_train.shape)

[[ 1.97177486  0.6037398   0.59763519 ...  1.4522248   2.00965299
   0.22916583]
 [ 1.26222679  2.23617957  2.2718962  ...  2.67776377  2.33747554
  -0.34839971]
 [ 0.55267873 -0.70221201 -0.74177362 ... -0.18182716 -0.61292736
  -0.34839971]
 ...
 [ 0.19790469 -0.0492361  -0.74177362 ... -0.99885314 -0.61292736
  -0.34839971]
 [-0.51164337 -0.70221201 -0.74177362 ... -0.18182716 -0.61292736
  -0.34839971]
 [ 0.90745276 -0.37572406  0.26278299 ... -0.18182716  0.04271773
  -0.34839971]]

tm = tree.DecisionTreeClassifier(criterion = 'entropy', max_depth=5)
# max_depth 트리 단계
# 불순도 : 분류가 안되고 섞여있는 상태 // 
# entropy 는 불순도 측정 함수이름
tm.fit(x_train, y_train)
y_hat = tm.predict(x_test)
print(y_hat[:10])

[4 4 4 4 4 4 2 2 4 4]

tmetrix = metrics.confusion_matrix(y_test, y_hat)
print(tmetrix)

# [[127   4]
#  [  2  72]]

tree_report = metrics.classification_report(y_test, y_hat)
print(tree_report)

              precision    recall  f1-score   support

           2       0.98      0.97      0.98       131
           4       0.95      0.97      0.96        74

    accuracy                           0.97       205
   macro avg       0.97      0.97      0.97       205
weighted avg       0.97      0.97      0.97       205

의사결정트리 : 학습데이터에 따라서 생성되는 데이터가 달라지므로 일반화하기 어렵다.
데이터에 따라 성능, 변동폭이 크다.
=> 단점을 보완하기 위한 알고리즘 randomforest

저작자표시 비영리 (새창열림)

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

26. 서울 중학교 졸업자 분석 \|\| dbscan, folium (0)	2021.11.24
25. 판매 데이터 분석 \|\| kmeans (0)	2021.11.24
23. titanic 분류 예측 \| KNN, SVM (0)	2021.11.24
22. auto-mpg \|\| 회귀분석 (0)	2021.11.24
21. 서울시 범죄율 분석 \|\| MinMaxscalimg (0)	2021.11.24

23. titanic 분류 예측 | KNN, SVM

2021. 11. 24. 14:04

728x90

범주형

설명변수 => 목표변수
목표변수가 범주형인 경우 한값에 분류하여 예측
질병진단, 스펨메일필터링
knn k nearest neighbors

import pandas as pd
import seaborn as sns
df = sns.load_dataset('titanic')
pd.set_option('display.max_columns', 15)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB

rdf = df.drop(['deck', 'embark_town'], axis = 1)
rdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   survived    891 non-null    int64   
 1   pclass      891 non-null    int64   
 2   sex         891 non-null    object  
 3   age         714 non-null    float64 
 4   sibsp       891 non-null    int64   
 5   parch       891 non-null    int64   
 6   fare        891 non-null    float64 
 7   embarked    889 non-null    object  
 8   class       891 non-null    category
 9   who         891 non-null    object  
 10  adult_male  891 non-null    bool    
 11  alive       891 non-null    object  
 12  alone       891 non-null    bool    
dtypes: bool(2), category(1), float64(2), int64(4), object(4)
memory usage: 72.4+ KB

rdf = rdf.dropna(subset=['age'], axis = 0)
rdf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 714 entries, 0 to 890
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   survived    714 non-null    int64   
 1   pclass      714 non-null    int64   
 2   sex         714 non-null    object  
 3   age         714 non-null    float64 
 4   sibsp       714 non-null    int64   
 5   parch       714 non-null    int64   
 6   fare        714 non-null    float64 
 7   embarked    712 non-null    object  
 8   class       714 non-null    category
 9   who         714 non-null    object  
 10  adult_male  714 non-null    bool    
 11  alive       714 non-null    object  
 12  alone       714 non-null    bool    
dtypes: bool(2), category(1), float64(2), int64(4), object(4)
memory usage: 63.6+ KB

most_freq = rdf['embarked'].value_counts(dropna=True).idxmax()
# rdf.groupby('embarked')['embarked'].count().idxmax()
rdf['embarked'].fillna(most_freq, inplace = True)
rdf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 714 entries, 0 to 890
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   survived    714 non-null    int64   
 1   pclass      714 non-null    int64   
 2   sex         714 non-null    object  
 3   age         714 non-null    float64 
 4   sibsp       714 non-null    int64   
 5   parch       714 non-null    int64   
 6   fare        714 non-null    float64 
 7   embarked    714 non-null    object  
 8   class       714 non-null    category
 9   who         714 non-null    object  
 10  adult_male  714 non-null    bool    
 11  alive       714 non-null    object  
 12  alone       714 non-null    bool    
dtypes: bool(2), category(1), float64(2), int64(4), object(4)
memory usage: 63.6+ KB

ndf = rdf[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'embarked']]
ndf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 714 entries, 0 to 890
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  714 non-null    int64  
 1   pclass    714 non-null    int64  
 2   sex       714 non-null    object 
 3   age       714 non-null    float64
 4   sibsp     714 non-null    int64  
 5   parch     714 non-null    int64  
 6   embarked  714 non-null    object 
dtypes: float64(1), int64(4), object(2)
memory usage: 44.6+ KB

ndf.describe()

	survived	pclass	age	sibsp	parch
count	714.000000	714.000000	714.000000	714.000000	714.000000
mean	0.406162	2.236695	29.699118	0.512605	0.431373
std	0.491460	0.838250	14.526497	0.929783	0.853289
min	0.000000	1.000000	0.420000	0.000000	0.000000
25%	0.000000	1.000000	20.125000	0.000000	0.000000
50%	0.000000	2.000000	28.000000	0.000000	0.000000
75%	1.000000	3.000000	38.000000	1.000000	1.000000
max	1.000000	3.000000	80.000000	5.000000	6.000000

# 원핫인코딩 # 범주형데이터를 모형이 인식할 수 있게 숫자형으로 변환

oh_set = pd.get_dummies(ndf['sex'])
oh_set.head()

	female	male
0	0	1
1	1	0
2	1	0
3	1	0
4	0	1

ndf = pd.concat([ndf, oh_set], axis = 1)
ndf.head()

	survived	pclass	sex	age	sibsp	parch	embarked	female	male
0	0	3	male	22.0	1	0	S	0	1
1	1	1	female	38.0	1	0	C	1	0
2	1	3	female	26.0	0	0	S	1	0
3	1	1	female	35.0	1	0	S	1	0
4	0	3	male	35.0	0	0	S	0	1

oh_embarked = pd.get_dummies(ndf['embarked'], prefix = 'town')
ndf = pd.concat([ndf, oh_embarked], axis = 1)
ndf

	survived	pclass	sex	age	sibsp	parch	embarked	female	male	town_C	town_Q	town_S
0	0	3	male	22.0	1	0	S	0	1	0	0	1
1	1	1	female	38.0	1	0	C	1	0	1	0	0
2	1	3	female	26.0	0	0	S	1	0	0	0	1
3	1	1	female	35.0	1	0	S	1	0	0	0	1
4	0	3	male	35.0	0	0	S	0	1	0	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...
885	0	3	female	39.0	0	5	Q	1	0	0	1	0
886	0	2	male	27.0	0	0	S	0	1	0	0	1
887	1	1	female	19.0	0	0	S	1	0	0	0	1
889	1	1	male	26.0	0	0	C	0	1	1	0	0
890	0	3	male	32.0	0	0	Q	0	1	0	1	0
714 rows × 12 columns

x =ndf[['pclass','age','sibsp','parch','female','male','town_C','town_Q','town_S']]
y = ndf['survived']
x.head()

	pclass	age	sibsp	parch	female	male	town_C	town_Q	town_S
0	3	22.0	1	0	0	1	0	0	1
1	1	38.0	1	0	1	0	1	0	0
2	3	26.0	0	0	1	0	0	0	1
3	1	35.0	1	0	1	0	0	0	1
4	3	35.0	0	0	0	1	0	0	1

# 설명변수 데이터 정규화
# 분석시 데이터 값의 크기에 따라서 분석의 결과에 영향
# 나이 범위가 크기 때문에 정규화를 통해 모든 속성변수들의 값을 기준단위로 변경

from sklearn import preprocessing
import numpy as np
preprocessing.StandardScaler().fit(x).transform(x)


array([[ 0.91123237, -0.53037664,  0.52457013, ..., -0.47180795,
        -0.20203051,  0.53307848],
       [-1.47636364,  0.57183099,  0.52457013, ...,  2.11950647,
        -0.20203051, -1.87589641],
       [ 0.91123237, -0.25482473, -0.55170307, ..., -0.47180795,
        -0.20203051,  0.53307848],
       ...,
       [-1.47636364, -0.73704057, -0.55170307, ..., -0.47180795,
        -0.20203051,  0.53307848],
       [-1.47636364, -0.25482473, -0.55170307, ...,  2.11950647,
        -0.20203051, -1.87589641],
       [ 0.91123237,  0.15850313, -0.55170307, ..., -0.47180795,
         4.94974747, -1.87589641]])

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state=10)
x_train.shape

# (499, 9)

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train, y_train)
y_hat = knn.predict(x_test)
print(y_hat[0:10])
print(y_test[0:10])

[0 0 1 0 0 1 0 1 0 0]
728    0
555    0
426    1
278    0
617    0
751    1
576    1
679    1
567    0
117    0
Name: survived, dtype: int64

# 성능평가
from sklearn import metrics
knn_matrix = metrics.confusion_matrix(y_test, y_hat)
print(knn_matrix)


# [[111  14]
#  [ 29  61]]

precision 정확도

예측 true 실제 true인 tp의 비율
정확도가 높은 것은 예측 T 실제 F fn이 작은 경우
Recall 재현율
실제값이 true 인 분석대상중 True로 예측한 비율
재현율의 높은 것은 fn 오류가 낮다
F1 score
정확도와 재현율이 조화 평균을 계산한 값
모형의 예측력을 평가 지표

knn_report = metrics.classification_report(y_test, y_hat)
print(knn_report)

              precision    recall  f1-score   support

           0       0.79      0.89      0.84       125
           1       0.81      0.68      0.74        90

    accuracy                           0.80       215
   macro avg       0.80      0.78      0.79       215
weighted avg       0.80      0.80      0.80       215

accuracy : 정확도 macro avg : 단순평균
weighted avg : 가중평균, 표본의 갯수로 가중평균

# svm support vector machine
from sklearn import svm
# kernel = 'rbf' 적용
# 커널 : 벡터공간으로 매핑함수
# rbf = radial basis function
# linear
# polynimial
# sigmoid
svm_model = svm.SVC(kernel='rbf')
svm_model.fit(x_train, y_train)
y_hat = svm_model.predict(x_test)
print(y_hat[0:10])

# [0 0 0 0 0 1 0 0 0 0]

from sklearn import metrics
svm_matrix = metrics.confusion_matrix(y_test, y_hat)
print(svm_matrix)

# [[118   7]
#  [ 79  11]]

svm_report = metrics.classification_report(y_test, y_hat)
print(svm_report)

              precision    recall  f1-score   support

           0       0.60      0.94      0.73       125
           1       0.61      0.12      0.20        90

    accuracy                           0.60       215
   macro avg       0.61      0.53      0.47       215
weighted avg       0.60      0.60      0.51       215

저작자표시 비영리 (새창열림)

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

25. 판매 데이터 분석 \|\| kmeans (0)	2021.11.24
24. 위스콘신 유방안데이터 분석 \|\| DT (0)	2021.11.24
22. auto-mpg \|\| 회귀분석 (0)	2021.11.24
21. 서울시 범죄율 분석 \|\| MinMaxscalimg (0)	2021.11.24
20. 서울시 인구분석 \|\| 다중회귀 (0)	2021.11.23

22. auto-mpg || 회귀분석

2021. 11. 24. 13:44

728x90

기계학습 각각변수들의 관계를 찾는 과정

예측:회귀분석
분류:knn
군집:Kmeans
머신러닝 프로세스 -> 데이터 분리 -> 알고리즘 준비-> 모형학습 -> 예측 -> 평가 -> 활용

# 회귀분석 : 가격, 매출, 주가 등 연속성 데이터 예측 알고리즘 import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns df = pd.read_csv('auto-mpg.csv', header = None) df.columns = ['mpg', 'cylinders', 'displacement', 'horsepower','weight','acceleration', 'model year', 'origin', 'name'] df mpg cylinders displacement horsepower weight acceleration model year origin name 0 18.0 8 307.0 130.0 3504.0 12.0 70 1 chevrolet chevelle malibu 1 15.0 8 350.0 165.0 3693.0 11.5 70 1 buick skylark 320 2 18.0 8 318.0 150.0 3436.0 11.0 70 1 plymouth satellite 3 16.0 8 304.0 150.0 3433.0 12.0 70 1 amc rebel sst 4 17.0 8 302.0 140.0 3449.0 10.5 70 1 ford torino ... ... ... ... ... ... ... ... ... ... 393 27.0 4 140.0 86.00 2790.0 15.6 82 1 ford mustang gl 394 44.0 4 97.0 52.00 2130.0 24.6 82 2 vw pickup 395 32.0 4 135.0 84.00 2295.0 11.6 82 1 dodge rampage 396 28.0 4 120.0 79.00 2625.0 18.6 82 1 ford ranger 397 31.0 4 119.0 82.00 2720.0 19.4 82 1 chevy s-10 398 rows × 9 columns

print(df.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 398 entries, 0 to 397 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 mpg 398 non-null float64 1 cylinders 398 non-null int64 2 displacement 398 non-null float64 3 horsepower 398 non-null object 4 weight 398 non-null float64 5 acceleration 398 non-null float64 6 model year 398 non-null int64 7 origin 398 non-null int64 8 name 398 non-null object dtypes: float64(4), int64(3), object(2) memory usage: 28.1+ KB None

print(df.describe()) mpg cylinders displacement weight acceleration \ count 398.000000 398.000000 398.000000 398.000000 398.000000 mean 23.514573 5.454774 193.425879 2970.424623 15.568090 std 7.815984 1.701004 104.269838 846.841774 2.757689 min 9.000000 3.000000 68.000000 1613.000000 8.000000 25% 17.500000 4.000000 104.250000 2223.750000 13.825000 50% 23.000000 4.000000 148.500000 2803.500000 15.500000 75% 29.000000 8.000000 262.000000 3608.000000 17.175000 max 46.600000 8.000000 455.000000 5140.000000 24.800000 model year origin count 398.000000 398.000000 mean 76.010050 1.572864 std 3.697627 0.802055 min 70.000000 1.000000 25% 73.000000 1.000000 50% 76.000000 1.000000 75% 79.000000 2.000000 max 82.000000 3.000000

print(df.horsepower.unique()) ['130.0' '165.0' '150.0' '140.0' '198.0' '220.0' '215.0' '225.0' '190.0' '170.0' '160.0' '95.00' '97.00' '85.00' '88.00' '46.00' '87.00' '90.00' '113.0' '200.0' '210.0' '193.0' '?' '100.0' '105.0' '175.0' '153.0' '180.0' '110.0' '72.00' '86.00' '70.00' '76.00' '65.00' '69.00' '60.00' '80.00' '54.00' '208.0' '155.0' '112.0' '92.00' '145.0' '137.0' '158.0' '167.0' '94.00' '107.0' '230.0' '49.00' '75.00' '91.00' '122.0' '67.00' '83.00' '78.00' '52.00' '61.00' '93.00' '148.0' '129.0' '96.00' '71.00' '98.00' '115.0' '53.00' '81.00' '79.00' '120.0' '152.0' '102.0' '108.0' '68.00' '58.00' '149.0' '89.00' '63.00' '48.00' '66.00' '139.0' '103.0' '125.0' '133.0' '138.0' '135.0' '142.0' '77.00' '62.00' '132.0' '84.00' '64.00' '74.00' '116.0' '82.00']

# df.loc[df['horsepower']=='?','horsepower'] df['horsepower'].replace('?', np.nan, inplace = True) print(df.horsepower.unique()) ['130.0' '165.0' '150.0' '140.0' '198.0' '220.0' '215.0' '225.0' '190.0' '170.0' '160.0' '95.00' '97.00' '85.00' '88.00' '46.00' '87.00' '90.00' '113.0' '200.0' '210.0' '193.0' nan '100.0' '105.0' '175.0' '153.0' '180.0' '110.0' '72.00' '86.00' '70.00' '76.00' '65.00' '69.00' '60.00' '80.00' '54.00' '208.0' '155.0' '112.0' '92.00' '145.0' '137.0' '158.0' '167.0' '94.00' '107.0' '230.0' '49.00' '75.00' '91.00' '122.0' '67.00' '83.00' '78.00' '52.00' '61.00' '93.00' '148.0' '129.0' '96.00' '71.00' '98.00' '115.0' '53.00' '81.00' '79.00' '120.0' '152.0' '102.0' '108.0' '68.00' '58.00' '149.0' '89.00' '63.00' '48.00' '66.00' '139.0' '103.0' '125.0' '133.0' '138.0' '135.0' '142.0' '77.00' '62.00' '132.0' '84.00' '64.00' '74.00' '116.0' '82.00']

# 누락 삭제 df['horsepower'].isnull().sum() df.dropna( subset = ['horsepower'], axis = 0, inplace=True) df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 392 entries, 0 to 397 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 mpg 392 non-null float64 1 cylinders 392 non-null int64 2 displacement 392 non-null float64 3 horsepower 392 non-null object 4 weight 392 non-null float64 5 acceleration 392 non-null float64 6 model year 392 non-null int64 7 origin 392 non-null int64 8 name 392 non-null object dtypes: float64(4), int64(3), object(2) memory usage: 30.6+ KB

pd.set_option('display.max_columns', 10) print(df.describe()) mpg cylinders displacement weight acceleration \ count 392.000000 392.000000 392.000000 392.000000 392.000000 mean 23.445918 5.471939 194.411990 2977.584184 15.541327 std 7.805007 1.705783 104.644004 849.402560 2.758864 min 9.000000 3.000000 68.000000 1613.000000 8.000000 25% 17.000000 4.000000 105.000000 2225.250000 13.775000 50% 22.750000 4.000000 151.000000 2803.500000 15.500000 75% 29.000000 8.000000 275.750000 3614.750000 17.025000 max 46.600000 8.000000 455.000000 5140.000000 24.800000 model year origin count 392.000000 392.000000 mean 75.979592 1.576531 std 3.683737 0.805518 min 70.000000 1.000000 25% 73.000000 1.000000 50% 76.000000 1.000000 75% 79.000000 2.000000 max 82.000000 3.000000

# 문자열 실수형으로 변환 df['horsepower'] = df['horsepower'].astype(float) # 분석에 활용할 속성 선택, 연비 , 실린더 ndf = df[['mpg','cylinders','horsepower','weight']] ndf.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 392 entries, 0 to 397 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 mpg 392 non-null float64 1 cylinders 392 non-null int64 2 horsepower 392 non-null float64 3 weight 392 non-null float64 dtypes: float64(3), int64(1) memory usage: 15.3 KB

# plt 산점도 ndf.plot(kind='scatter', x = 'weight', y='mpg', c= 'coral', s=10, figsize= (10,5)) plt.show()

# sns 산점도 fig = plt.figure(figsize = (10,5)) ax1 = fig.add_subplot(1,2,1) ax2 = fig.add_subplot(1,2,2) sns.regplot(x='weight', y='mpg', data =ndf, ax=ax1) sns.regplot(x='weight', y='mpg', data =ndf, ax=ax2, fit_reg=False) # 회귀선 미표시 plt.show()

# joinplot sns.jointplot(x='weight', y='mpg', data =ndf) sns.jointplot(x='weight', y='mpg', kind = 'reg', data =ndf) plt.show()

# seaborn pairplot sns.pairplot(ndf, kind = 'reg') plt.show()

# 독립변수 여러개 x = ndf[['weight']] print(type(x)) # 종속변수 1개 y = ndf['mpg'] print(type(y)) # <class 'pandas.core.frame.DataFrame'> # <class 'pandas.core.series.Series'>

from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state = 10) print(len(x_train)) # 274 print(len(y_train)) # 274

from sklearn.linear_model import LinearRegression lr = LinearRegression() # 훈련 독립변수, 정답 종속변수 lr.fit(x_train, y_train) LinearRegression()

r_square = lr.score(x_test, y_test) print(r_square) # 0.6822458558299325

# 기울기 print('기울기 a', lr.coef_) print('절편 b', lr.intercept_) 기울기 a [-0.00775343] 절편 b 46.7103662572801

y_hat = lr.predict(x) plt.figure(figsize=(10,5)) ax1 = sns.kdeplot(y, label='y') ax2 = sns.kdeplot(y_hat, label='y_hat', ax=ax1) plt.legend() plt.show()

# 단순회귀분석 : 두변수간 관계를 직선으로 분석
# 다항회귀분석 : 회귀선을 곡선으로 더 높은 정확도

from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=2) x_train_poly = poly.fit_transform(x_train) print('원데이터 :', x_train.shape) print('2차항 변환 데이터 :', x_train_poly.shape) # 원데이터 : (274, 1) # 2차항 변환 데이터 : (274, 3)

pr = LinearRegression() pr.fit(x_train_poly, y_train) x_test_poly = poly.fit_transform(x_test) r_square = pr.score(x_test_poly, y_test) y_hat_test = pr.predict(x_test_poly) print('기울기 a', pr.coef_) print('절편 b', pr.intercept_) # 기울기 a [ 0.00000000e+00 -1.85768289e-02 1.70491223e-06] # 절편 b 62.58071221576951

# 산점도 그리기 fig = plt.figure(figsize=(10,5)) ax = fig.add_subplot(1,1,1) ax.plot(x_train, y_train, 'o', label = 'Train Data') ax.plot(x_test, y_hat_test, 'r+', label = 'Predicted Value') ax.legend(loc = 'best') plt.xlabel('weight') plt.ylabel('mpg') plt.show()

x_poly = poly.fit_transform(x) y_hat = pr.predict(x_poly) plt.figure(figsize = (10,5)) ax1 = sns.kdeplot(y, label='y') ax2 = sns.kdeplot(y_hat, label='y_hat', ax = ax1) plt.legend() plt.show()

# 단순회귀분석 : 독립변수, 종속변수가 한개일때
# 다중 회귀분석 : 독립변수가 여러개일 경우
# y = b + a1*x1 + a2*x2 +...+an*xn

from sklearn.linear_model import LinearRegression lr = LinearRegression() x = ndf[['cylinders', 'horsepower', 'weight']] # 다중회귀분석 y = ndf['mpg'] x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state = 10)

lr.fit(x_train, y_train) r_square = lr.score(x_test, y_test) print('결정계수 :', r_square) # 결정계수 : 0.6939048496695597

print('기울기 a', lr.coef_) print('절편 b', lr.intercept_) # 기울기 a [-0.60691288 -0.03714088 -0.00522268] # 절편 b 46.41435126963407

y_hat = lr.predict(x_test)

plt.figure(figsize = (10,5)) ax1 = sns.kdeplot(y_test, label='y_test') ax2 = sns.kdeplot(y_hat, label='y_hat', ax = ax1) plt.legend() plt.show()

y_hat = lr.predict(x) plt.figure(figsize = (10,5)) ax1 = sns.kdeplot(y, label='y') ax2 = sns.kdeplot(y_hat, label='y_hat', ax = ax1) plt.legend() plt.show()

x = [[10], [5], [9], [7]] y = [100, 50, 90, 77] lr = LinearRegression() lr.fit(x,y) result =lr.predict([[7]]) plt.figure(figsize = (10,5)) ax1 = sns.kdeplot(y, label='y')

저작자표시 비영리 (새창열림)

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

24. 위스콘신 유방안데이터 분석 \|\| DT (0)	2021.11.24
23. titanic 분류 예측 \| KNN, SVM (0)	2021.11.24
21. 서울시 범죄율 분석 \|\| MinMaxscalimg (0)	2021.11.24
20. 서울시 인구분석 \|\| 다중회귀 (0)	2021.11.23
19. 세계음주데이터2 (0)	2021.11.23

21. 서울시 범죄율 분석 || MinMaxscalimg

2021. 11. 24. 13:24

728x90

02. crime_in_Seoul

0.00MB

# 구별 범죄율 분석 import numpy as np import pandas as pd crime_police = pd.read_csv('02. crime_in_Seoul.csv', thousands = ',', encoding = 'euc-kr') crime_police.head() 관서명 살인 발생 살인 검거 강도 발생 강도 검거 강간 발생 강간 검거 절도 발생 절도 검거 폭력 발생 폭력 검거 0 중부서 2 2 3 2 105 65 1395 477 1355 1170 1 종로서 3 3 6 5 115 98 1070 413 1278 1070 2 남대문서 1 0 6 4 65 46 1153 382 869 794 3 서대문서 2 2 5 4 154 124 1812 738 2056 1711 4 혜화서 3 2 5 4 96 63 1114 424 1015 861

crime_police.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 31 entries, 0 to 30 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 관서명 31 non-null object 1 살인 발생 31 non-null int64 2 살인 검거 31 non-null int64 3 강도 발생 31 non-null int64 4 강도 검거 31 non-null int64 5 강간 발생 31 non-null int64 6 강간 검거 31 non-null int64 7 절도 발생 31 non-null int64 8 절도 검거 31 non-null int64 9 폭력 발생 31 non-null int64 10 폭력 검거 31 non-null int64 dtypes: int64(10), object(1) memory usage: 2.8+ KB

police_state = pd.read_csv('경찰청_경찰관서 위치, 주소_20200409.csv', thousands = ',', encoding = 'euc-kr') police_state.head() 청 서 지구대파출소 X좌표 Y좌표 주소 0 강원청 강릉경찰서 강동파출소 128.978300 37.727760 강원도 강릉시 강동면 안인리 764-1 1 강원청 강릉경찰서 강릉경찰서 128.906763 37.768700 강릉시 포남동 1113 2 강원청 강릉경찰서 남부지구대 128.897125 37.748968 강릉시 노암동 715-16 3 강원청 강릉경찰서 동부지구대 128.926315 37.774032 강릉시 송정동 740-3 4 강원청 강릉경찰서 북부지구대 128.875237 37.835265 강릉시 주문진읍 주문리 312-7

police_state[police_state['지구대파출소'] == '서울중부경찰서'] 청 서 지구대파출소 X좌표 Y좌표 주소 1464 서울청 서울중부경찰서 서울중부경찰서 126.989614 37.563514 중구 저동2가 62-1

police_seoul = police_state[police_state['청'] == '서울청'] police_seoul.head() 청 서 지구대파출소 X좌표 Y좌표 주소 1207 서울청 서울강남경찰서 논현1파출소 127.029316 37.513791 강남구 논현동 58-13 1208 서울청 서울강남경찰서 논현2파출소 127.033875 37.515220 강남구 논현동 89-13 1209 서울청 서울강남경찰서 삼성1파출소 127.060440 37.514852 강남구 삼성동 107-3 1210 서울청 서울강남경찰서 삼성2파출소 127.048063 37.511289 강남구 삼성동 114-6 1211 서울청 서울강남경찰서 서울강남경찰서 127.067177 37.509036 강남구 대치동 998

# 경찰서의 이름을 서울xxx경찰서 형태 # 중부서 > 중부경찰서 police_name = [] for name in crime_police['관서명'] : police_name.append('서울' + str(name[:-1]) + '경찰서') print(police_name) ['서울중부경찰서', '서울종로경찰서', '서울남대문경찰서', '서울서대문경찰서', '서울혜화경찰서', '서울용산경찰서', '서울성북경찰서', '서울동대문경찰서', '서울마포경찰서', '서울영등포경찰서', '서울성동경찰서', '서울동작경찰서', '서울광진경찰서', '서울서부경찰서', '서울강북경찰서', '서울금천경찰서', '서울중랑경찰서', '서울강남경찰서', '서울관악경찰서', '서울강서경찰서', '서울강동경찰서', '서울종암경찰서', '서울구로경찰서', '서울서초경찰서', '서울양천경찰서', '서울송파경찰서', '서울노원경찰서', '서울방배경찰서', '서울은평경찰서', '서울도봉경찰서', '서울수서경찰서']

# 경찰서의 구이름을 리스트로 출력 police_seoul police_address = [] for name in police_name : select_police = police_seoul[ police_seoul['지구대파출소'] == name ] # print(select_police['주소']) police_address.append(select_police.loc[:,'주소']) police_address [1464 중구 저동2가 62-1 Name: 주소, dtype: object, 1439 종로구 경운동 90-18 Name: 주소, dtype: object, 1286 중구 남대문로5가 561 Name: 주소, dtype: object, 1345 서대문구 미근동 165 Name: 주소, dtype: object, 1475 종로구 인의동 48-57 Name: 주소, dtype: object, 1421 용산구 원효로1가 12-12 Name: 주소, dtype: object, 1377 성북구 삼선동5가 301 Name: 주소, dtype: object, 1310 동대문구 청량1리동 229 Name: 주소, dtype: object, 1333 마포구 아현동 618-1 Name: 주소, dtype: object, 1413 영등포구 당산동3가 2-11 Name: 주소, dtype: object, 1366 성동구 행당동 192-8 Name: 주소, dtype: object, 1326 동작구 노량진1동 72 Name: 주소, dtype: object, 1259 광진구 구의동 254-32 Name: 주소, dtype: object, 1353 은평구 녹번동 177-15 Name: 주소, dtype: object, 1229 강북구 번1동 415-15 Name: 주소, dtype: object, 1282 금천구 신림8동 544 Name: 주소, dtype: object, 1460 중랑구 묵2동 249-2 Name: 주소, dtype: object, 1211 강남구 대치동 998 Name: 주소, dtype: object, 1253 관악구 봉천4동 산177-3 Name: 주소, dtype: object, 1242 강서구 화곡6동 980-27 Name: 주소, dtype: object, 1222 강동구 성내1동 540-1 Name: 주소, dtype: object, 1449 성북구 종암1동 3-1260 Name: 주소, dtype: object, 1273 구로구 구로2동 436 Name: 주소, dtype: object, 1359 서초구 서초3동 1726 Name: 주소, dtype: object, 1402 양천구 신정6동 321 Name: 주소, dtype: object, 1388 송파구 가락본동 9 Name: 주소, dtype: object, 1297 노원구 하계동 250 Name: 주소, dtype: object, 1341 서초구 방배동 455-10 Name: 주소, dtype: object, 1431 은평구 불광2동 산24 Name: 주소, dtype: object, 1303 도봉구 창4동 17 Name: 주소, dtype: object, 1397 강남구 개포동 14 Name: 주소, dtype: object]

gu_name = [] for name in police_address : tmp = name.str.split() tmp = tmp.tolist() tmp_gu = tmp[0] gu_name.append(tmp_gu[0]) gu_name ['중구', '종로구', '중구', '서대문구', '종로구', '용산구', '성북구', '동대문구', '마포구', '영등포구', '성동구', '동작구', '광진구', '은평구', '강북구', '금천구', '중랑구', '강남구', '관악구', '강서구', '강동구', '성북구', '구로구', '서초구', '양천구', '송파구', '노원구', '서초구', '은평구', '도봉구', '강남구']

crime_police['구별'] = gu_name crime_police.head() 관서명 살인 발생 살인 검거 강도 발생 강도 검거 강간 발생 강간 검거 절도 발생 절도 검거 폭력 발생 폭력 검거 구별 0 중부서 2 2 3 2 105 65 1395 477 1355 1170 중구 1 종로서 3 3 6 5 115 98 1070 413 1278 1070 종로구 2 남대문서 1 0 6 4 65 46 1153 382 869 794 중구 3 서대문서 2 2 5 4 154 124 1812 738 2056 1711 서대문구 4 혜화서 3 2 5 4 96 63 1114 424 1015 861 종로구

# 구별 범죄건수들의 합계 구하기 crime_police.groupby('구별').sum() crime_sum = pd.pivot_table(crime_police, index = '구별', aggfunc=np.sum) crime_sum 강간 검거 강간 발생 강도 검거 강도 발생 살인 검거 살인 발생 절도 검거 절도 발생 폭력 검거 폭력 발생 구별 강남구 349 449 18 21 10 13 1650 3850 3705 4284 강동구 123 156 8 6 3 4 789 2366 2248 2712 강북구 126 153 13 14 8 7 618 1434 2348 2649 강서구 191 262 13 13 8 7 1260 2096 2718 3207 관악구 221 320 14 12 8 9 827 2706 2642 3298 광진구 220 240 26 14 4 4 1277 3026 2180 2625 구로구 164 281 11 15 6 8 889 2335 2432 3007 금천구 122 151 6 6 4 3 888 1567 1776 2054 노원구 121 197 7 7 10 10 801 2193 2329 2723 도봉구 106 102 10 9 3 3 478 1063 1303 1487 동대문구 146 173 13 13 5 5 814 1981 2227 2548 동작구 139 285 5 9 5 5 661 1865 1587 1910 마포구 247 294 10 14 8 8 813 2555 2519 2983 서대문구 124 154 4 5 2 2 738 1812 1711 2056 서초구 249 393 6 9 6 8 1091 2635 2098 2399 성동구 119 126 8 9 4 4 597 1607 1395 1612 성북구 124 150 4 5 5 5 741 1785 1855 2209 송파구 178 220 10 13 10 11 1129 3239 2786 3295 양천구 105 120 3 6 5 3 672 1890 2030 2509 영등포구 183 295 20 22 12 14 978 2964 2961 3572 용산구 173 194 14 14 5 5 587 1557 1704 2050 은평구 141 166 6 9 3 3 711 1914 2306 2653 종로구 161 211 9 11 5 6 837 2184 1931 2293 중구 111 170 6 9 2 3 859 2548 1964 2224 중랑구 148 187 9 11 12 13 829 2135 2407 2847

crime_sum['강간검거율'] = crime_sum['강간 검거']/crime_sum['강간 발생']*100 crime_sum['강도검거율'] = crime_sum['강도 검거']/crime_sum['강도 발생']*100 crime_sum['살인검거율'] = crime_sum['살인 검거']/crime_sum['살인 발생']*100 crime_sum['절도검거율'] = crime_sum['절도 검거']/crime_sum['절도 발생']*100 crime_sum['폭력검거율'] = crime_sum['폭력 검거']/crime_sum['폭력 발생']*100 crime_sum[['살인 검거', '살인 발생', '살인검거율']] del crime_sum['강간 검거'] del crime_sum['강도 검거'] del crime_sum['살인 검거'] del crime_sum['절도 검거'] del crime_sum['폭력 검거'] crime_sum[['살인 발생', '살인검거율']].head() 살인 발생 살인검거율 구별 강남구 13 76.923077 강동구 4 75.000000 강북구 7 114.285714 강서구 7 114.285714 관악구 9 88.888889

crime_sum.loc[crime_sum['강간검거율'] > 100, '강간검거율'] = 100 col_list = ['강간검거율','강도검거율','살인검거율','절도검거율','폭력검거율'] for column in col_list : crime_sum.loc[crime_sum[column] > 100, column] = 100 crime_sum.head() crime_sum.rename(columns = {'강간 발생' : '강간', '강도 발생' : '강도', '살인 발생' : '살인', '절도 발생' : '절도', '폭력 발생' : '폭력'}, inplace= True) crime_sum 강간 강도 살인 절도 폭력 강간검거율 강도검거율 살인검거율 절도검거율 폭력검거율 구별 강남구 449 21 13 3850 4284 77.728285 85.714286 76.923077 42.857143 86.484594 강동구 156 6 4 2366 2712 78.846154 100.000000 75.000000 33.347422 82.890855 강북구 153 14 7 1434 2649 82.352941 92.857143 100.000000 43.096234 88.637222 강서구 262 13 7 2096 3207 72.900763 100.000000 100.000000 60.114504 84.752105 관악구 320 12 9 2706 3298 69.062500 100.000000 88.888889 30.561715 80.109157 광진구 240 14 4 3026 2625 91.666667 100.000000 100.000000 42.200925 83.047619 구로구 281 15 8 2335 3007 58.362989 73.333333 75.000000 38.072805 80.877951 금천구 151 6 3 1567 2054 80.794702 100.000000 100.000000 56.668794 86.465433 노원구 197 7 10 2193 2723 61.421320 100.000000 100.000000 36.525308 85.530665 도봉구 102 9 3 1063 1487 100.000000 100.000000 100.000000 44.967074 87.626093 동대문구 173 13 5 1981 2548 84.393064 100.000000 100.000000 41.090358 87.401884 동작구 285 9 5 1865 1910 48.771930 55.555556 100.000000 35.442359 83.089005 마포구 294 14 8 2555 2983 84.013605 71.428571 100.000000 31.819961 84.445189 서대문구 154 5 2 1812 2056 80.519481 80.000000 100.000000 40.728477 83.219844 서초구 393 9 8 2635 2399 63.358779 66.666667 75.000000 41.404175 87.453105 성동구 126 9 4 1607 1612 94.444444 88.888889 100.000000 37.149969 86.538462 성북구 150 5 5 1785 2209 82.666667 80.000000 100.000000 41.512605 83.974649 송파구 220 13 11 3239 3295 80.909091 76.923077 90.909091 34.856437 84.552352 양천구 120 6 3 1890 2509 87.500000 50.000000 100.000000 35.555556 80.908729 영등포구 295 22 14 2964 3572 62.033898 90.909091 85.714286 32.995951 82.894737 용산구 194 14 5 1557 2050 89.175258 100.000000 100.000000 37.700706 83.121951 은평구 166 9 3 1914 2653 84.939759 66.666667 100.000000 37.147335 86.920467 종로구 211 11 6 2184 2293 76.303318 81.818182 83.333333 38.324176 84.212822 중구 170 9 3 2548 2224 65.294118 66.666667 66.666667 33.712716 88.309353 중랑구 187 11 13 2135 2847 79.144385 81.818182 92.307692 38.829040 84.545135

from sklearn import preprocessing as pp col = ['강간', '강도', '살인', '절도', '폭력'] x = crime_sum[col].values min_max_scaler = pp.MinMaxScaler() # 0 ~ 1 최대최소로 정규화 // 값이 범위가 너무 다르면 일치시키기 위해 x_scaled = min_max_scaler.fit_transform(x.astype(float)) crime_norm = pd.DataFrame(x_scaled, columns = col, index = crime_sum.index) crime_norm 강간 강도 살인 절도 폭력 구별 강남구 1.000000 0.941176 0.916667 1.000000 1.000000 강동구 0.155620 0.058824 0.166667 0.467528 0.437969 강북구 0.146974 0.529412 0.416667 0.133118 0.415445 강서구 0.461095 0.470588 0.416667 0.370649 0.614945 관악구 0.628242 0.411765 0.583333 0.589523 0.647479 광진구 0.397695 0.529412 0.166667 0.704342 0.406864 구로구 0.515850 0.588235 0.500000 0.456405 0.543439 금천구 0.141210 0.058824 0.083333 0.180840 0.202717 노원구 0.273775 0.117647 0.666667 0.405454 0.441902 도봉구 0.000000 0.235294 0.083333 0.000000 0.000000 동대문구 0.204611 0.470588 0.250000 0.329386 0.379335 동작구 0.527378 0.235294 0.250000 0.287765 0.151233 마포구 0.553314 0.529412 0.500000 0.535343 0.534859 서대문구 0.149856 0.000000 0.000000 0.268748 0.203432 서초구 0.838617 0.235294 0.500000 0.564047 0.326064 성동구 0.069164 0.235294 0.166667 0.195192 0.044691 성북구 0.138329 0.000000 0.250000 0.259060 0.258134 송파구 0.340058 0.470588 0.750000 0.780768 0.646407 양천구 0.051873 0.058824 0.083333 0.296735 0.365391 영등포구 0.556196 1.000000 1.000000 0.682095 0.745442 용산구 0.265130 0.529412 0.250000 0.177252 0.201287 은평구 0.184438 0.235294 0.083333 0.305346 0.416875 종로구 0.314121 0.352941 0.333333 0.402225 0.288166 중구 0.195965 0.235294 0.083333 0.532831 0.263497 중랑구 0.244957 0.352941 0.916667 0.384643 0.486235

저작자표시 비영리 (새창열림)

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

23. titanic 분류 예측 \| KNN, SVM (0)	2021.11.24
22. auto-mpg \|\| 회귀분석 (0)	2021.11.24
20. 서울시 인구분석 \|\| 다중회귀 (0)	2021.11.23
19. 세계음주데이터2 (0)	2021.11.23
18. 세계음주 데이터 분석 (0)	2021.11.03

20. 서울시 인구분석 || 다중회귀

2021. 11. 23. 20:58

728x90

01. CCTV_in_Seoul

0.00MB

import pandas as pd cctv_seoul = pd.read_csv('01. CCTV_in_Seoul.csv', encoding = 'utf-8') cctv_seoul.rename(columns ={'기관명':'구별'}, inplace = True) cctv_seoul 구별 소계 2013년도 이전 2014년 2015년 2016년 0 강남구 2780 1292 430 584 932 1 강동구 773 379 99 155 377 2 강북구 748 369 120 138 204 3 강서구 884 388 258 184 81 4 관악구 1496 846 260 390 613 5 광진구 707 573 78 53 174 6 구로구 1561 1142 173 246 323 7 금천구 1015 674 51 269 354 8 노원구 1265 542 57 451 516 9 도봉구 485 238 159 42 386 10 동대문구 1294 1070 23 198 579 11 동작구 1091 544 341 103 314 12 마포구 574 314 118 169 379 13 서대문구 962 844 50 68 292 14 서초구 1930 1406 157 336 398 15 성동구 1062 730 91 241 265 16 성북구 1464 1009 78 360 204 17 송파구 618 529 21 68 463 18 양천구 2034 1843 142 30 467 19 영등포구 904 495 214 195 373 20 용산구 1624 1368 218 112 398 21 은평구 1873 1138 224 278 468 22 종로구 1002 464 314 211 630 23 중구 671 413 190 72 348 24 중랑구 660 509 121 177 109

pop_seoul = pd.read_excel('01. population_in_Seoul.xls', header = 2, usecols = 'B, D, G, J, N') pop_seoul.rename(columns = {pop_seoul.columns[0]:'구별', pop_seoul.columns[1]:'인구수', pop_seoul.columns[2]:'한국인', pop_seoul.columns[3]:'외국인', pop_seoul.columns[4]:'고령자'}, inplace = True) pop_seoul.head() 구별 인구수 한국인 외국인 고령자 0 합계 10197604.0 9926968.0 270636.0 1321458.0 1 종로구 162820.0 153589.0 9231.0 25425.0 2 중구 133240.0 124312.0 8928.0 20764.0 3 용산구 244203.0 229456.0 14747.0 36231.0 4 성동구 311244.0 303380.0 7864.0 39997.0

pop_seoul.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 27 entries, 0 to 26 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 구별 26 non-null object 1 인구수 26 non-null float64 2 한국인 26 non-null float64 3 외국인 26 non-null float64 4 고령자 26 non-null float64 dtypes: float64(4), object(1) memory usage: 1.2+ KB

cctv_seoul['최근증가율'] = (cctv_seoul['2014년'] + cctv_seoul['2015년'] + cctv_seoul['2016년']) / \ cctv_seoul['2013년도 이전'] * 100 cctv_seoul.sort_values(by='최근증가율', ascending = False).head() 구별 소계 2013년도 이전 2014년 2015년 2016년 최근증가율 22 종로구 1002 464 314 211 630 248.922414 9 도봉구 485 238 159 42 386 246.638655 12 마포구 574 314 118 169 379 212.101911 8 노원구 1265 542 57 451 516 188.929889 1 강동구 773 379 99 155 377 166.490765

pop_seoul['외국인비율'] = pop_seoul['외국인'] /pop_seoul['인구수'] pop_seoul['고령자비율'] = pop_seoul['고령자'] /pop_seoul['인구수'] pop_seoul.head() 구별 인구수 한국인 외국인 고령자 외국인비율 고령자비율 0 합계 10197604.0 9926968.0 270636.0 1321458.0 0.026539 0.129585 1 종로구 162820.0 153589.0 9231.0 25425.0 0.056695 0.156154 2 중구 133240.0 124312.0 8928.0 20764.0 0.067007 0.155839 3 용산구 244203.0 229456.0 14747.0 36231.0 0.060388 0.148364 4 성동구 311244.0 303380.0 7864.0 39997.0 0.025266 0.128507

# 외국인비율 상위 5개 pop_seoul.sort_values(by = '외국인비율', ascending = False).head() 구별 인구수 한국인 외국인 고령자 외국인비율 고령자비율 19 영등포구 402985.0 368072.0 34913.0 52413.0 0.086636 0.130062 18 금천구 255082.0 236353.0 18729.0 32970.0 0.073423 0.129253 17 구로구 447874.0 416487.0 31387.0 56833.0 0.070080 0.126895 2 중구 133240.0 124312.0 8928.0 20764.0 0.067007 0.155839 3 용산구 244203.0 229456.0 14747.0 36231.0 0.060388 0.148364

data_result = pd.merge(cctv_seoul, pop_seoul, on = '구별') data_result.head() 구별 소계 2013년도 이전 2014년 2015년 2016년 최근증가율 인구수 한국인 외국인 고령자 외국인비율 고령자비율 0 강남구 2780 1292 430 584 932 150.619195 570500.0 565550.0 4950.0 63167.0 0.008677 0.110722 1 강동구 773 379 99 155 377 166.490765 453233.0 449019.0 4214.0 54622.0 0.009298 0.120516 2 강북구 748 369 120 138 204 125.203252 330192.0 326686.0 3506.0 54813.0 0.010618 0.166003 3 강서구 884 388 258 184 81 134.793814 603772.0 597248.0 6524.0 72548.0 0.010805 0.120158 4 관악구 1496 846 260 390 613 149.290780 525515.0 507203.0 18312.0 68082.0 0.034846 0.129553

del data_result['2013년도 이전'] del data_result['2014년'] del data_result['2015년'] del data_result['2016년'] data_result.head() 구별 소계 최근증가율 인구수 한국인 외국인 고령자 외국인비율 고령자비율 0 강남구 2780 150.619195 570500.0 565550.0 4950.0 63167.0 0.008677 0.110722 1 강동구 773 166.490765 453233.0 449019.0 4214.0 54622.0 0.009298 0.120516 2 강북구 748 125.203252 330192.0 326686.0 3506.0 54813.0 0.010618 0.166003 3 강서구 884 134.793814 603772.0 597248.0 6524.0 72548.0 0.010805 0.120158 4 관악구 1496 149.290780 525515.0 507203.0 18312.0 68082.0 0.034846 0.129553

data_result.set_index('구별', inplace = True) data_result.head() 소계 최근증가율 인구수 한국인 외국인 고령자 외국인비율 고령자비율 구별 강남구 2780 150.619195 570500.0 565550.0 4950.0 63167.0 0.008677 0.110722 강동구 773 166.490765 453233.0 449019.0 4214.0 54622.0 0.009298 0.120516 강북구 748 125.203252 330192.0 326686.0 3506.0 54813.0 0.010618 0.166003 강서구 884 134.793814 603772.0 597248.0 6524.0 72548.0 0.010805 0.120158 관악구 1496 149.290780 525515.0 507203.0 18312.0 68082.0 0.034846 0.129553

data_result.corr(method='pearson') 소계 최근증가율 인구수 한국인 외국인 고령자 외국인비율 고령자비율 소계 1.000000 -0.343016 0.306342 0.304287 -0.023786 0.255196 -0.136074 -0.280786 최근증가율 -0.343016 1.000000 -0.093068 -0.082511 -0.150463 -0.070969 -0.044042 0.185089 인구수 0.306342 -0.093068 1.000000 0.998061 -0.153371 0.932667 -0.591939 -0.669462 한국인 0.304287 -0.082511 0.998061 1.000000 -0.214576 0.931636 -0.637911 -0.660812 외국인 -0.023786 -0.150463 -0.153371 -0.214576 1.000000 -0.155381 0.838904 -0.014055 고령자 0.255196 -0.070969 0.932667 0.931636 -0.155381 1.000000 -0.606088 -0.380468 외국인비율 -0.136074 -0.044042 -0.591939 -0.637911 0.838904 -0.606088 1.000000 0.267348 고령자비율 -0.280786 0.185089 -0.669462 -0.660812 -0.014055 -0.380468 0.267348 1.000000

import numpy as np np.corrcoef(data_result['고령자비율'], data_result['소계']) array([[ 1. , -0.28078554], [-0.28078554, 1. ]]) np.corrcoef(data_result['외국인비율'], data_result['소계']) array([[ 1. , -0.13607433], [-0.13607433, 1. ]]) np.corrcoef(data_result['인구수'], data_result['소계']) array([[1. , 0.30634228], [0.30634228, 1. ]])

#matplot import matplotlib.pyplot as plt from matplotlib import font_manager, rc plt.figure() plt.rc('font', family = 'Malgun Gothic') data_result['소계'].sort_values(ascending = True).plot(kind='barh', grid=True, figsize = (10,10)) plt.show()

# 인구대비 cctv 비율컬럼 data_result['cctv비율'] = data_result['소계'] / data_result['인구수'] * 100 data_result.head() 소계 최근증가율 인구수 한국인 외국인 고령자 외국인비율 고령자비율 cctv비율 구별 강남구 2780 150.619195 570500.0 565550.0 4950.0 63167.0 0.008677 0.110722 0.487292 강동구 773 166.490765 453233.0 449019.0 4214.0 54622.0 0.009298 0.120516 0.170552 강북구 748 125.203252 330192.0 326686.0 3506.0 54813.0 0.010618 0.166003 0.226535 강서구 884 134.793814 603772.0 597248.0 6524.0 72548.0 0.010805 0.120158 0.146413 관악구 1496 149.290780 525515.0 507203.0 18312.0 68082.0 0.034846 0.129553 0.284673

data_result['cctv비율'].sort_values().plot(kind='barh', grid = True, figsize = (10,10)) plt.show()

# 산점도 plt.figure(figsize = (6,6)) plt.scatter(data_result['인구수'], data_result['소계'], s= 50) plt.xlabel('인구수') plt.ylabel('cctv갯수') plt.grid() plt.show()

# 인구수와 소계 산점도, 회귀선 작성 # polyfit 최소제곱법을 이용한 상수값, 1 : 차수 fpl = np.polyfit(data_result['인구수'], data_result['소계'], 1) # ㅣ 직선 f1 = np.poly1d(fpl) fx = np.linspace(100000, 700000, 100) plt.figure(figsize = (10, 10)) plt.scatter(data_result['인구수'], data_result['소계'], s=50) plt.plot(fx, f1(fx), ls='dashed', lw=3, color='g') plt.xlabel('인구수') plt.ylabel('cctv') plt.grid() plt.show()

# 인구수와 소계 산점도, 회귀선 작성 # polyfit 최소제곱법을 이용한 상수값, 1 : 차수 fpl = np.polyfit(data_result['인구수'], data_result['소계'], 4) # ㅣ 직선 f1 = np.poly1d(fpl) # 인구수에 맞는 y값 fx = np.linspace(100000, 700000, 100) plt.figure(figsize = (10, 10)) plt.scatter(data_result['인구수'], data_result['소계'], s=50) plt.plot(fx, f1(fx), ls='dashed', lw=3, color='g') plt.xlabel('인구수') plt.ylabel('cctv') plt.grid() plt.show()

import numpy as np x = np.array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.]) y = np.array([4.23620563, 6.18696492, 2.83930821, 5.00923197, 11.51299327, 12.91581993, 14.51838241, 14.34881875, 18.13566499, 20.1408104, 21.9872241]) fit1 = np.polyfit(x, y, 1) # 2개 상수 (+ 절편) fit2 = np.polyfit(x, y, 2) # 3개 상수 (+ 절편) fit3 = np.polyfit(x, y, 3) # 4개 상수 (+ 절편) print(fit1) print(fit2) print(fit3) # [1.92858279 2.34176099] # [0.05915413 1.33704154 3.22907288] # [-0.02808825 0.48047788 -0.26960637 4.24024989]

num = len(x) for i in range(num) : fit1 = 1.92858279*x + 2.34176099 fit2 = 0.05915413*x**2 + 1.33704154*x + 3.22907288 fit3 = - 0.02808825*x**3 + 0.48047788*x**2 - 0.26960637*x + 4.24024989 print(fit3) # [ 4.24024989 4.42303315 5.39824267 6.99734895 9.05182249 11.39313379 # 13.85275335 16.26215167 18.45279925 20.25616659 21.50372419]

# xy 산점도 와 회귀선 plt.scatter(x, y) plt.plot(x, y) plt.plot(x, fit1) plt.show()

# xy 산점도 와 회귀선 plt.scatter(x, y) plt.plot(x, y) plt.plot(x, fit3) plt.show()

# 산점도 + 회귀선, 산점도에 색상을 회귀선과의 거리로 표시 # 회귀선을 위한 상수 fpl = np.polyfit(data_result['인구수'], data_result['소계'], 2) # fpl상수값을 이용하여 y값을 계산하기 위한 함수 f1 = np.poly1d(fpl) # X축값, 10만 ~ 70만 까지 100등분 fx = np.linspace(100000, 700000, 100) # data_result 인구수에 맞는 회귀선의 y값 # 절대값 data_result['오차'] = np.abs(data_result['소계'] - f1(data_result['인구수'])) df_sort = data_result.sort_values(by = '오차', ascending = False) # 그래프 작성 plt.figure(figsize = (14,10)) plt.scatter(data_result['인구수'], data_result['소계'], c =data_result['오차'], s=50) plt.plot(fx, f1(fx), ls = 'dashed', lw = 3, color = 'g') # 점에 구 이름 표시 for n in range(10) : # 라벨링 // 절대값 오차가 많은 구10개 정보 표시 plt.text(df_sort['인구수'][n]*1.02, df_sort['소계'][n]*0.98, # 약간 밑으로 df_sort.index[n], fontsize = 15) plt.xlabel('인구수') plt.ylabel('cctv갯수') plt.colorbar() plt.grid() plt.show()

저작자표시 비영리 (새창열림)

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

22. auto-mpg \|\| 회귀분석 (0)	2021.11.24
21. 서울시 범죄율 분석 \|\| MinMaxscalimg (0)	2021.11.24
19. 세계음주데이터2 (0)	2021.11.23
18. 세계음주 데이터 분석 (0)	2021.11.03
17. 서울 기온 분석 (0)	2021.11.02

19. 세계음주데이터2

2021. 11. 23. 20:04

728x90

drinks.csv

0.00MB

from scipy import stats
import pandas as pd
drinks = pd.read_csv('drinks.csv')
drinks['continent'] = drinks['continent'].fillna('OT')
drinks.info

<bound method DataFrame.info of          country  beer_servings  spirit_servings  wine_servings  \
0    Afghanistan              0                0              0   
1        Albania             89              132             54   
2        Algeria             25                0             14   
3        Andorra            245              138            312   
4         Angola            217               57             45   
..           ...            ...              ...            ...   
188    Venezuela            333              100              3   
189      Vietnam            111                2              1   
190        Yemen              6                0              0   
191       Zambia             32               19              4   
192     Zimbabwe             64               18              4   

     total_litres_of_pure_alcohol continent  
0                             0.0        AS  
1                             4.9        EU  
2                             0.7        AF  
3                            12.4        EU  
4                             5.9        AF  
..                            ...       ...  
188                           7.7        SA  
189                           2.0        AS  
190                           0.1        AS  
191                           2.5        AF  
192                           4.7        AF  

[193 rows x 6 columns]>

africa = drinks.loc[drinks['continent']=='AF']
europe = drinks.loc[drinks['continent']=='EU']
# 두집단간 평균의 차이
tTestResult = stats.ttest_ind(africa['beer_servings'], europe['beer_servings'])
tTestResultDiffVar = stats.ttest_ind(africa['beer_servings'], europe['beer_servings'], equal_var = False)

# 두집단의 분산이 같다 가설
print(tTestResult)

# Ttest_indResult(statistic=-7.267986335644365, pvalue=9.719556422442453e-11)

# 두집단의 분산이 다르다 가설
print(tTestResultDiffVar)

# Ttest_indResult(statistic=-7.143520192189803, pvalue=2.9837787864303205e-10)

- t-statistic : 평균차이, 음수 : 뒤쪽 데이터의 평균 큰 경우, 검정 통계
- p-value : 유의확률, 결과가 0, 두집단의 평균이 같지 않다. => 귀무가설이 기각, 맞다틀리다
- 귀무가설 : 현재가설이 맞지 않다를 증명 // 예상되는 가설
- 대립가설 : 귀무가설의 반대되는 가설,
- 아프리카와 유럽의 맥주소비량의 차이는 확률적으로 다르다
- => 통계적으로 유의미하다

# 대한민국은 얼마나 술을 독하게 마실까?
drinks['total_servings'] =  drinks['beer_servings'] + drinks['spirit_servings']+drinks['wine_servings']
drinks.head()

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent	total_servings
0	Afghanistan	0	0	0	0.0	AS	0
1	Albania	89	132	54	4.9	EU	275
2	Algeria	25	0	14	0.7	AF	39
3	Andorra	245	138	312	12.4	EU	695
4	Angola	217	57	45	5.9	AF	319

drinks['alcohol_rate'] = drinks['total_litres_of_pure_alcohol'] / drinks['total_servings']
# alcohol rate , 분모가 0이면 결측값이 생김
drinks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 8 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   country                       193 non-null    object 
 1   beer_servings                 193 non-null    int64  
 2   spirit_servings               193 non-null    int64  
 3   wine_servings                 193 non-null    int64  
 4   total_litres_of_pure_alcohol  193 non-null    float64
 5   continent                     193 non-null    object 
 6   total_servings                193 non-null    int64  
 7   alcohol_rate                  180 non-null    float64
dtypes: float64(2), int64(4), object(2)
memory usage: 12.2+ KB

drinks['alcohol_rate'].fillna(0, inplace = True)
drinks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 8 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   country                       193 non-null    object 
 1   beer_servings                 193 non-null    int64  
 2   spirit_servings               193 non-null    int64  
 3   wine_servings                 193 non-null    int64  
 4   total_litres_of_pure_alcohol  193 non-null    float64
 5   continent                     193 non-null    object 
 6   total_servings                193 non-null    int64  
 7   alcohol_rate                  193 non-null    float64
dtypes: float64(2), int64(4), object(2)
memory usage: 12.2+ KB

country_alcohol_rank = drinks[['country', 'alcohol_rate']]
country_alcohol_rank = country_alcohol_rank.sort_values(by = ['alcohol_rate'], ascending = False)
country_alcohol_rank.head()

	country	alcohol_rate
63	Gambia	0.266667
153	Sierra Leone	0.223333
124	Nigeria	0.185714
179	Uganda	0.153704
142	Rwanda	0.151111

import numpy as np
import matplotlib.pyplot as plt
country_list = country_alcohol_rank.country.tolist()
x_pos = np.arange(len(country_list))
rank = country_alcohol_rank.alcohol_rate.tolist()
country_list.index("South Korea")

bar_list = plt.bar(x_pos, rank)
bar_list[country_list.index('South Korea')].set_color('r')
plt.ylabel('alcohol rate')
plt.title('liquor drink rank by country')
plt.axis([0, 200, 0, 0.3])

korea_rank = country_list.index('South Korea')
korea_alc_rate = country_alcohol_rank[country_alcohol_rank['country'] == 'South Korea']['alcohol_rate'].values[0]
plt.annotate('South korea :' + str(korea_rank + 1), xy = (korea_rank, korea_alc_rate), 
            xytext = (korea_rank + 10, korea_alc_rate + 0.05),
            arrowprops = dict(facecolor = 'red', shrink = 0.05))
plt.show()

#  전체 소비량을 막대그래프로 작성
country_serving_rank = drinks[['country','total_servings']]
country_serving_rank = country_serving_rank.sort_values(by=['total_servings'], ascending=0)
country_serving_rank.head()

	country	total_servings
3	Andorra	695
68	Grenada	665
45	Czech Republic	665
61	France	648
141	Russian Federation	646

# 그래프 작성하기 
country_list = country_serving_rank.country.tolist()
x_pos = np.arange(len(country_list))
rank = country_serving_rank.total_servings.tolist()

bar_list = plt.bar(x_pos, rank)
bar_list[country_list.index('South Korea')].set_color('r')
plt.ylabel('alcohol rate')
plt.title('liquor drink rank by country')
plt.axis([0, 200, 0, 700])

korea_rank = country_list.index('South Korea')
korea_serving_rate = country_serving_rank[country_serving_rank['country'] == 'South Korea']['total_servings'].values[0]
plt.annotate('South korea :' + str(korea_rank + 1), xy = (korea_rank, korea_serving_rate), 
            xytext = (korea_rank + 10, korea_serving_rate + 0.05),
            arrowprops = dict(facecolor = 'red', shrink = 0.05))
plt.show()

저작자표시 비영리 (새창열림)

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

21. 서울시 범죄율 분석 \|\| MinMaxscalimg (0)	2021.11.24
20. 서울시 인구분석 \|\| 다중회귀 (0)	2021.11.23
18. 세계음주 데이터 분석 (0)	2021.11.03
17. 서울 기온 분석 (0)	2021.11.02
16. EDA, 멕시코식당 주문 CHIPOTLE (0)	2021.10.28

18. 세계음주 데이터 분석

2021. 11. 3. 00:30

728x90

drinks.csv

0.00MB

import pandas as pd
drinks = pd.read_csv('drinks.csv')
drinks.info

<bound method DataFrame.info of          country  beer_servings  spirit_servings  wine_servings  \
0    Afghanistan              0                0              0   
1        Albania             89              132             54   
2        Algeria             25                0             14   
3        Andorra            245              138            312   
4         Angola            217               57             45   
..           ...            ...              ...            ...   
188    Venezuela            333              100              3   
189      Vietnam            111                2              1   
190        Yemen              6                0              0   
191       Zambia             32               19              4   
192     Zimbabwe             64               18              4   

     total_litres_of_pure_alcohol continent  
0                             0.0        AS  
1                             4.9        EU  
2                             0.7        AF  
3                            12.4        EU  
4                             5.9        AF  
..                            ...       ...  
188                           7.7        SA  
189                           2.0        AS  
190                           0.1        AS  
191                           2.5        AF  
192                           4.7        AF  

[193 rows x 6 columns]>

drinks.head

<bound method NDFrame.head of          country  beer_servings  spirit_servings  wine_servings  \
0    Afghanistan              0                0              0   
1        Albania             89              132             54   
2        Algeria             25                0             14   
3        Andorra            245              138            312   
4         Angola            217               57             45   
..           ...            ...              ...            ...   
188    Venezuela            333              100              3   
189      Vietnam            111                2              1   
190        Yemen              6                0              0   
191       Zambia             32               19              4   
192     Zimbabwe             64               18              4   

     total_litres_of_pure_alcohol continent  
0                             0.0        AS  
1                             4.9        EU  
2                             0.7        AF  
3                            12.4        EU  
4                             5.9        AF  
..                            ...       ...  
188                           7.7        SA  
189                           2.0        AS  
190                           0.1        AS  
191                           2.5        AF  
192                           4.7        AF  

[193 rows x 6 columns]>

# 피처 상관관계
# 피어스 상관계수 
# 'beer_serving', 'wine_servings'
corr = drinks[['beer_servings', 'wine_servings']].corr(method = 'pearson')
corr


	beer_servings	wine_servings
beer_servings	1.000000	0.527172
wine_servings	0.527172	1.000000

corr = drinks[['beer_servings','spirit_servings', 'wine_servings','total_litres_of_pure_alcohol']].corr(method = 'pearson')
corr

	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol
beer_servings	1.000000	0.458819	0.527172	0.835839
spirit_servings	0.458819	1.000000	0.194797	0.654968
wine_servings	0.527172	0.194797	1.000000	0.667598
total_litres_of_pure_alcohol	0.835839	0.654968	0.667598	1.000000

# 상관계수 시각화
import matplotlib.pyplot as plt
import seaborn as sns
cols_view = ['beer','spirit', 'wine', 'alcohol']
sns.set(font_scale = 1.5)
hm = sns.heatmap(corr.values, cbar = True, annot = True, square=True,
                fmt = '.2f', annot_kws = {'size':15},
                yticklabels = cols_view, xticklabels = cols_view)
plt.show()

hm = sns.pairplot(drinks)
plt.show()

drinks.isnull().sum()
# drinks.info()

country                          0
beer_servings                    0
spirit_servings                  0
wine_servings                    0
total_litres_of_pure_alcohol     0
continent                       23
dtype: int64

drinks['continent'] = drinks['continent'].fillna('OT')

drinks['continent'].value_counts()

AF    53
EU    45
AS    44
OT    23
OC    16
SA    12
Name: continent, dtype: int64

# 대륙별 국가수 출력
print(drinks.groupby('continent').count()['country'])

continent
AF    53
AS    44
EU    45
OC    16
OT    23
SA    12
Name: country, dtype: int64

plt.pie(drinks['continent'].value_counts(),
        labels = drinks['continent'].value_counts().index.tolist(),
       autopct='%.0f%%',
       explode = (0,0,0,0.2,0,0),
       shadow=True)
plt.title('null data to "ot"')
plt.show()

drinks.groupby('continent')['spirit_servings'].max()

continent
AF    152
AS    326
EU    373
OC    254
OT    438
SA    302
Name: spirit_servings, dtype: int64

drinks.groupby('continent')['spirit_servings'].mean()

continent
AF     16.339623
AS     60.840909
EU    132.555556
OC     58.437500
OT    165.739130
SA    114.750000
Name: spirit_servings, dtype: float64

drinks.groupby('continent')['spirit_servings'].agg(['mean','min','max','sum'])

	mean	min	max	sum
continent				
AF	16.339623	0	152	866
AS	60.840909	0	326	2677
EU	132.555556	0	373	5965
OC	58.437500	0	254	935
OT	165.739130	68	438	3812
SA	114.750000	25	302	1377

dm = drinks['total_litres_of_pure_alcohol'].mean()
con_mean = drinks.groupby('continent')['total_litres_of_pure_alcohol'].mean()
con_mean[con_mean >= dm]

continent
EU    8.617778
OT    5.995652
SA    6.308333
Name: total_litres_of_pure_alcohol, dtype: float64

dmax = drinks.groupby('continent')['beer_servings'].mean()
dmax[dmax == dmax.max()]

continent
EU    193.777778
Name: beer_servings, dtype: float64

drinks.groupby('continent')['beer_servings'].mean().idxmax()

# 'EU'

drinks.groupby('continent')['beer_servings'].mean().idxmin()

# 'AS'

result

	mean	min	max	sum
continent				
AF	16.339623	0	152	866
AS	60.840909	0	326	2677
EU	132.555556	0	373	5965
OC	58.437500	0	254	935
OT	165.739130	68	438	3812
SA	114.750000	25	302	1377

result.index
# Index(['AF', 'AS', 'EU', 'OC', 'OT', 'SA'], dtype='object', name='continent')

import numpy as np
# result = drinks.groupby('continent')['beer_servings'].agg(['mean', 'min', 'max', 'sum']
means = result['mean'].tolist() 
mins = result['min'].tolist() 
maxs = result['max'].tolist()  
sums = result['sum'].tolist()                                                            
index = np.arange(len(result.index))
bar_width = 0.1
rects1 = plt.bar(index, means, bar_width, color = 'r', label = 'Mean')
rects2 = plt.bar(index, mins, bar_width, color = 'g', label = 'Min')
rects3 = plt.bar(index, maxs, bar_width, color = 'b', label = 'Max')
rects4 = plt.bar(index, sums, bar_width, color = 'y', label = 'Sum')
plt.xticks(index, result.index.tolist())
plt.legend(loc="best")
plt.show()

# 대륙별 total_litres_of_pure_alcohol 섭취량 평균을 시각화
import numpy as np
continent_mean = drinks.groupby('continent')['total_litres_of_pure_alcohol'].mean()
total_mean = drinks.total_litres_of_pure_alcohol.mean()

continents = continent_mean.index.tolist()
continents.append('Mean')

x_pos = np.arange(len(continents))
alcohol = continent_mean.tolist()
alcohol.append(total_mean)

bar_list = plt.bar(x_pos, alcohol, align = 'center', alpha = 0.5)
bar_list[len(continents)-1].set_color('r')
plt.plot([0., 6], [total_mean, total_mean], "k--")
plt.xticks(x_pos, continents)
plt.ylabel('total_litres_of_pure_alcohol')
plt.title('total_litres_of_pure_alcohol by continent')
plt.show()

# 대륙별 beer_serving 합계를 막대그래프로 시각화
# eu 막대의 색상을 빨강색으로 변경하기
# 전체 맥주 소비량 합계의 평균을 구해서 막대 그래프에 추가
# 평균선을 출력하기, 막대 색상은 노랑색
# 평균 선은 검정색("k--")
beer_sum = drinks.groupby('continent')['beer_servings'].sum()
beer_sum

continent
AF    3258
AS    1630
EU    8720
OC    1435
OT    3345
SA    2101
Name: beer_servings, dtype: int64

beer_mean = beer_sum.mean()
beer_mean
# 3414.8333333333335

continents = beer_sum.index.tolist()
continents.append("Mean")
continents

# ['AF', 'AS', 'EU', 'OC', 'OT', 'SA', 'Mean']

x_pos = np.arange(len(continents))
alcohol = beer_sum.tolist()
alcohol.append(beer_mean)
alcohol

[3258, 1630, 8720, 1435, 3345, 2101, 3414.8333333333335]

bar_list = plt.bar(x_pos, alcohol, align='center', alpha = 0.5)
bar_list[2].set_color("r")

저작자표시 비영리 (새창열림)

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

20. 서울시 인구분석 \|\| 다중회귀 (0)	2021.11.23
19. 세계음주데이터2 (0)	2021.11.23
17. 서울 기온 분석 (0)	2021.11.02
16. EDA, 멕시코식당 주문 CHIPOTLE (0)	2021.10.28
15. 스크래핑 (0)	2021.10.28

PREV 1 ···7 8 9 10 11 12 NEXT

My_Flow

Data_Science

27. 프로야구 연봉 예측 분석 || OLS, Heatmap

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

26. 서울 중학교 졸업자 분석 || dbscan, folium

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

25. 판매 데이터 분석 || kmeans

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

24. 위스콘신 유방안데이터 분석 || DT

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

23. titanic 분류 예측 | KNN, SVM

범주형

precision 정확도

Recall 재현율

F1 score

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

22. auto-mpg || 회귀분석

기계학습 각각변수들의 관계를 찾는 과정

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

21. 서울시 범죄율 분석 || MinMaxscalimg

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

20. 서울시 인구분석 || 다중회귀

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

19. 세계음주데이터2

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

18. 세계음주 데이터 분석

'Data_Science > Data_Analysis_Py' 카테고리의 다른 글

+ Recent posts

티스토리툴바