데이터 소개¶

- 이번 주제는 European Soccer Database 데이터셋을 사용합니다.

- 다음 1개의 sqlite 데이터베이스를 사용합니다.
database.sqlite

- 데이터 베이스 내 총 7개의 Table을 사용합니다.
Country: 국가 정보
League: 리그 정보
Match: 경기 정보 (주 데이터셋)
Player: 플레이어 정보
Player_Attributes: 플레이어의 특성
Team: 팀 정보
Team_Attributes: 팀의 특성

데이터 출처: https://www.kaggle.com/hugomathien/soccer

Step 0. 데이터베이스와 SQL¶

SQL과 Query¶

Step 1. 데이터셋 준비하기¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

문제 3. sqlite3와 Pandas로 sqlite 데이터베이스 읽어들이기¶

import sqlite3

# sqlite3.connect()와 pd.read_sql_query()로 csv파일 읽어들이기
conn = sqlite3.connect('database.sqlite')

df_country = pd.read_sql_query('SELECT * from Country', conn)
df_league = pd.read_sql_query('SELECT * from League', conn)
df_match = pd.read_sql_query('SELECT * from Match', conn)
df_player = pd.read_sql_query('SELECT * from Player', conn)
df_player_att = pd.read_sql_query('SELECT * from Player_Attributes', conn)
df_team = pd.read_sql_query('SELECT * from Team', conn)
df_team_att = pd.read_sql_query('SELECT * from Team_Attributes', conn)

Step 2. EDA 및 데이터 기초 통계 분석¶

문제 4. 각 데이터프레임의 구조 파악하기¶

# DataFrame에서 제공하는 메소드를 이용하여 각 데이터프레임의 구조 분석하기 (head(), info(), describe())
df_country.head()

df_country.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      11 non-null     int64 
 1   name    11 non-null     object
dtypes: int64(1), object(1)
memory usage: 304.0+ bytes

df_league.head()

df_league.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          11 non-null     int64 
 1   country_id  11 non-null     int64 
 2   name        11 non-null     object
dtypes: int64(2), object(1)
memory usage: 392.0+ bytes

df_league['id'].unique()

array([    1,  1729,  4769,  7809, 10257, 13274, 15722, 17642, 19694,
       21518, 24558], dtype=int64)

df_league['country_id'].unique()

array([    1,  1729,  4769,  7809, 10257, 13274, 15722, 17642, 19694,
       21518, 24558], dtype=int64)

df_match.head()

원하는 리그만 뽑아서 보기¶

df_match.loc[df_match['league_id']==1729]

df_match['season'].unique()

array(['2008/2009', '2009/2010', '2010/2011', '2011/2012', '2012/2013',
       '2013/2014', '2014/2015', '2015/2016'], dtype=object)

for c in df_match.columns:
    print(c)

id
country_id
league_id
season
stage
date
match_api_id
home_team_api_id
away_team_api_id
home_team_goal
away_team_goal
home_player_X1
home_player_X2
home_player_X3
home_player_X4
home_player_X5
home_player_X6
home_player_X7
home_player_X8
home_player_X9
home_player_X10
home_player_X11
away_player_X1
away_player_X2
away_player_X3
away_player_X4
away_player_X5
away_player_X6
away_player_X7
away_player_X8
away_player_X9
away_player_X10
away_player_X11
home_player_Y1
home_player_Y2
home_player_Y3
home_player_Y4
home_player_Y5
home_player_Y6
home_player_Y7
home_player_Y8
home_player_Y9
home_player_Y10
home_player_Y11
away_player_Y1
away_player_Y2
away_player_Y3
away_player_Y4
away_player_Y5
away_player_Y6
away_player_Y7
away_player_Y8
away_player_Y9
away_player_Y10
away_player_Y11
home_player_1
home_player_2
home_player_3
home_player_4
home_player_5
home_player_6
home_player_7
home_player_8
home_player_9
home_player_10
home_player_11
away_player_1
away_player_2
away_player_3
away_player_4
away_player_5
away_player_6
away_player_7
away_player_8
away_player_9
away_player_10
away_player_11
goal
shoton
shotoff
foulcommit
card
cross
corner
possession
B365H
B365D
B365A
BWH
BWD
BWA
IWH
IWD
IWA
LBH
LBD
LBA
PSH
PSD
PSA
WHH
WHD
WHA
SJH
SJD
SJA
VCH
VCD
VCA
GBH
GBD
GBA
BSH
BSD
BSA

for c,num in zip(df_match.columns,df_match.isna().sum()):
    print(c,num)

id 0
country_id 0
league_id 0
season 0
stage 0
date 0
match_api_id 0
home_team_api_id 0
away_team_api_id 0
home_team_goal 0
away_team_goal 0
home_player_X1 1821
home_player_X2 1821
home_player_X3 1832
home_player_X4 1832
home_player_X5 1832
home_player_X6 1832
home_player_X7 1832
home_player_X8 1832
home_player_X9 1832
home_player_X10 1832
home_player_X11 1832
away_player_X1 1832
away_player_X2 1832
away_player_X3 1832
away_player_X4 1832
away_player_X5 1832
away_player_X6 1832
away_player_X7 1832
away_player_X8 1832
away_player_X9 1833
away_player_X10 1833
away_player_X11 1839
home_player_Y1 1821
home_player_Y2 1821
home_player_Y3 1832
home_player_Y4 1832
home_player_Y5 1832
home_player_Y6 1832
home_player_Y7 1832
home_player_Y8 1832
home_player_Y9 1832
home_player_Y10 1832
home_player_Y11 1832
away_player_Y1 1832
away_player_Y2 1832
away_player_Y3 1832
away_player_Y4 1832
away_player_Y5 1832
away_player_Y6 1832
away_player_Y7 1832
away_player_Y8 1832
away_player_Y9 1833
away_player_Y10 1833
away_player_Y11 1839
home_player_1 1224
home_player_2 1315
home_player_3 1281
home_player_4 1323
home_player_5 1316
home_player_6 1325
home_player_7 1227
home_player_8 1309
home_player_9 1273
home_player_10 1436
home_player_11 1555
away_player_1 1234
away_player_2 1278
away_player_3 1293
away_player_4 1321
away_player_5 1335
away_player_6 1313
away_player_7 1235
away_player_8 1341
away_player_9 1328
away_player_10 1441
away_player_11 1554
goal 11762
shoton 11762
shotoff 11762
foulcommit 11762
card 11762
cross 11762
corner 11762
possession 11762
B365H 3387
B365D 3387
B365A 3387
BWH 3404
BWD 3404
BWA 3404
IWH 3459
IWD 3459
IWA 3459
LBH 3423
LBD 3423
LBA 3423
PSH 14811
PSD 14811
PSA 14811
WHH 3408
WHD 3408
WHA 3408
SJH 8882
SJD 8882
SJA 8882
VCH 3411
VCD 3411
VCA 3411
GBH 11817
GBD 11817
GBA 11817
BSH 11818
BSD 11818
BSA 11818

df_match.drop(df_match.columns[-38:],axis=1,inplace=True)

df_match.head()

df_player_att.head()

문제 5. 데이터프레임간의 관계 파악하기¶

# 데이터프레임 간 중복되는 Column이 있는지 확인하고 유용한 Column 식별하기
# Hint) unique()로 값을 비교하거나, map() 등을 활용하여 Column 관계를 확인
df_player_att['player_api_id'].value_counts()

210278    56
41269     56
42116     55
26472     54
179795    53
          ..
163254     2
266280     2
37366      2
238766     2
26568      2
Name: player_api_id, Length: 11060, dtype: int64

att에 중복되는 컬럼(선수)이 많으므로 선수들의 평균치를 내기¶

df_match

df_match['home_player_1'].dropna().apply(int).map(df_player_att.groupby('player_api_id').mean()['overall_rating']).isna().sum()

0

1개에서 6개라 그룹바이로 중복 없애기¶

df_team_att['team_api_id'].value_counts()

8191    6
8668    6
8661    6
8659    6
8658    6
       ..
8613    1
6631    1
6351    1
8614    1
9824    1
Name: team_api_id, Length: 288, dtype: int64

nan 을 다 날리고 float 을 int로바꿈¶

df_match['away_team_api_id'].map(df_team_att.groupby('team_api_id').mean()['buildUpPlaySpeed']).isna().sum()

178

문제 6. 선수 특ፑ#49457; 사이의 상관성 파악하기¶

df_player_att['overall_rating']

0         67.0
1         67.0
2         62.0
3         61.0
4         61.0
          ... 
183973    83.0
183974    78.0
183975    77.0
183976    78.0
183977    80.0
Name: overall_rating, Length: 183978, dtype: float64

fig=plt.figure(figsize=(5,15))
sns.heatmap(df_player_att.drop(['id', 'player_fifa_api_id', 'player_api_id'], axis=1).corr()[['overall_rating']], annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1c410364940>

문제 7. 매치 데이터프레임에 팀 특성 데이터프레임 통합하기¶

# DataFrame의 map() 메소드를 활용하여 데이터프레임 통합하기

df_team_att.columns

Index(['id', 'team_fifa_api_id', 'team_api_id', 'date', 'buildUpPlaySpeed',
       'buildUpPlaySpeedClass', 'buildUpPlayDribbling',
       'buildUpPlayDribblingClass', 'buildUpPlayPassing',
       'buildUpPlayPassingClass', 'buildUpPlayPositioningClass',
       'chanceCreationPassing', 'chanceCreationPassingClass',
       'chanceCreationCrossing', 'chanceCreationCrossingClass',
       'chanceCreationShooting', 'chanceCreationShootingClass',
       'chanceCreationPositioningClass', 'defencePressure',
       'defencePressureClass', 'defenceAggression', 'defenceAggressionClass',
       'defenceTeamWidth', 'defenceTeamWidthClass',
       'defenceDefenderLineClass'],
      dtype='object')

df_team_att.drop('buildUpPlayDribbling', axis=1, inplace=True)

df_team_att.columns

Index(['id', 'team_fifa_api_id', 'team_api_id', 'date', 'buildUpPlaySpeed',
       'buildUpPlaySpeedClass', 'buildUpPlayDribblingClass',
       'buildUpPlayPassing', 'buildUpPlayPassingClass',
       'buildUpPlayPositioningClass', 'chanceCreationPassing',
       'chanceCreationPassingClass', 'chanceCreationCrossing',
       'chanceCreationCrossingClass', 'chanceCreationShooting',
       'chanceCreationShootingClass', 'chanceCreationPositioningClass',
       'defencePressure', 'defencePressureClass', 'defenceAggression',
       'defenceAggressionClass', 'defenceTeamWidth', 'defenceTeamWidthClass',
       'defenceDefenderLineClass'],
      dtype='object')

def most(x):
  return x.value_counts().index[0]

df_team_att['buildUpPlayPassingClass'].value_counts().index[0]

'Mixed'

team_map=df_team_att.groupby

team_map = df_team_att.groupby('team_api_id').aggregate(
    {
       'buildUpPlaySpeed': 'mean',
       'buildUpPlaySpeedClass': most,
       'buildUpPlayDribblingClass': most,
       'buildUpPlayPassing': 'mean',
       'buildUpPlayPassingClass': most,
       'buildUpPlayPositioningClass': most,
       'chanceCreationPassing': 'mean',
       'chanceCreationPassingClass': most,
       'chanceCreationCrossing': 'mean',
       'chanceCreationCrossingClass': most,
       'chanceCreationShooting': 'mean',
       'chanceCreationShootingClass': most,
       'chanceCreationPositioningClass': most,
       'defencePressure': 'mean',
       'defencePressureClass': most,
       'defenceAggression': 'mean',
       'defenceAggressionClass': most,
       'defenceTeamWidth': 'mean',
       'defenceTeamWidthClass': most,
       'defenceDefenderLineClass': most
    }
)

team_map

새로운 df 만들고 팀 정보 추가하기¶

df = df_match[['home_team_goal', 'away_team_goal']].copy()

for team in ['home_', 'away_']:
  team_map.index.name = team + 'team_api_id'
  for col in team_map.columns:
    df[team + col] = df_match[team_map.index.name].map(team_map[col])

df

df.dropna(inplace=True)

df.head()

문제 8. 홈과 어웨이의 골 수를 승-무-패 범주로 변환하기¶

# 홈과 어웨이의 골 수를 범주형 데이터로 변환하기 (0: 홈팀 승, 1: 무승부, 2: 어웨이팀 승)
df['matchResult'] = df[['home_team_goal', 'away_team_goal']].aggregate(lambda x: 0 if x[0] > x[1] else 1 if x[0] == x[1] else 2, axis=1)

df.drop(['home_team_goal','away_team_goal'],axis=1,inplace=True)

df['matchResult']

0        1
1        1
2        2
3        0
5        1
        ..
25972    1
25974    0
25975    2
25976    0
25978    0
Name: matchResult, Length: 25629, dtype: int64

Step 3. 모델 학습을 위한 데이터 전처리¶

문제 9. get_dummies를 이용하여 범주형 데이터 전처리하기¶

col_cats=list(filter(lambda s: s.find('Class')>=0,df.columns))
col_cats

['home_buildUpPlaySpeedClass',
 'home_buildUpPlayDribblingClass',
 'home_buildUpPlayPassingClass',
 'home_buildUpPlayPositioningClass',
 'home_chanceCreationPassingClass',
 'home_chanceCreationCrossingClass',
 'home_chanceCreationShootingClass',
 'home_chanceCreationPositioningClass',
 'home_defencePressureClass',
 'home_defenceAggressionClass',
 'home_defenceTeamWidthClass',
 'home_defenceDefenderLineClass',
 'away_buildUpPlaySpeedClass',
 'away_buildUpPlayDribblingClass',
 'away_buildUpPlayPassingClass',
 'away_buildUpPlayPositioningClass',
 'away_chanceCreationPassingClass',
 'away_chanceCreationCrossingClass',
 'away_chanceCreationShootingClass',
 'away_chanceCreationPositioningClass',
 'away_defencePressureClass',
 'away_defenceAggressionClass',
 'away_defenceTeamWidthClass',
 'away_defenceDefenderLineClass']

df_cats = pd.get_dummies(df[col_cats], drop_first=True)

df_cats

문제 10. StandardScaler를 이용해 수치형 데이터 표준화하기¶

from sklearn.preprocessing import StandardScaler

# StandardScaler를 이용해 수치형 데이터를 표준화하기
# Hint) Multicollinearity를 피하기 위해 불필요한 컬럼은 drop한다.

X_num = df.drop(['matchResult'] + col_cats, axis=1)
scaler = StandardScaler()
scaler.fit(X_num)
X_scaled = scaler.transform(X_num)
X_scaled = pd.DataFrame(data=X_scaled, index=X_num.index, columns=X_num.columns)

X_cat = df_cats
X = pd.concat([X_scaled, X_cat], axis=1)
y = df['matchResult']

df['matchResult'].value_counts()

0    11756
2     7363
1     6510
Name: matchResult, dtype: int64

문제 11. 학습데이터와 테스트데이터 분리하기¶

from sklearn.model_selection import train_test_split

# train_test_split() 함수로 학습 데이터와 테스트 데이터 분리하기
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

Step 4. Classification 모델 학습하기¶

문제 12. Logistic Regression 모델 생성/학습하기¶

from sklearn.linear_model import LogisticRegression

# LogisticRegression 모델 생성/학습
model_lr = LogisticRegression(max_iter=10000)
model_lr.fit(X_train,y_train)

LogisticRegression(max_iter=10000)

문제 13. 모델 학습 결과 평가하기¶

from sklearn.metrics import classification_report

# Predict를 수행하고 classification_report() 결과 출력하기
pred = model_lr.predict(X_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.50      0.87      0.63      3484
           1       0.38      0.00      0.01      1972
           2       0.45      0.31      0.37      2233

    accuracy                           0.49      7689
   macro avg       0.44      0.40      0.34      7689
weighted avg       0.45      0.49      0.40      7689

전체중의 값을 보면 어떤 지표가 가장 비슷한 결과를 보였는지 알 수 있음 -> f1score¶

print(sum((y_test == 0)) / len(y_test))
print(sum((y_test == 1)) / len(y_test))
print(sum((y_test == 2)) / len(y_test))

0.4531148393809338
0.2564702822213552
0.290414878397711

문제 14. XGBoost 모델 생성/학습하기¶

나름대로 절반 넘게 맞춤 -> 51 % 로 logistic 보다 향상¶

from xgboost import XGBClassifier

# XGBClassifier 모델 생성/학습
model_xgb = XGBClassifier()
model_xgb.fit(X_train, y_train)

C:\Users\Administrator\anaconda3\lib\site-packages\xgboost\sklearn.py:888: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
  warnings.warn(label_encoder_deprecation_msg, UserWarning)

[11:13:19] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=8, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

문제 15. 모델 학습 결과 평가하기¶

# Predict를 수행하고 classification_report() 결과 출력하기
pred = model_xgb.predict(X_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.54      0.72      0.61      3484
           1       0.29      0.15      0.20      1972
           2       0.45      0.40      0.42      2233

    accuracy                           0.48      7689
   macro avg       0.42      0.42      0.41      7689
weighted avg       0.45      0.48      0.45      7689

Step5 모델 학습 결과 심화 분석하기¶

문제 16. Logistic Regression 모델 계수로 상관성 파악하기¶

away가 강하게 수비하면 홈팀이 이길가능성 up -> 공격이 최고의 수비..¶

# Logistic Regression 모델의 coef_ 속성을 plot하기
fig = plt.figure(figsize=(15, 5))
plt.plot(X.columns, model_lr.coef_[0])
plt.xticks(rotation=90)
plt.title('What makes Home Team Win?')
plt.grid()
plt.show()

문제 17. XGBoost 모델로 특징의 중요도 확인하기¶

수치보다 범주가 영향을 많이 주는 것보면 -> 상성이 영향을 많이 끼침¶

생각보다 수치가 별로 중요한 영향을 안줌¶

# XGBoost 모델의 feature_importances_ 속성을 plot하기
fig = plt.figure(figsize=(15, 6))
plt.bar(X.columns, model_xgb.feature_importances_)
plt.xticks(rotation=90)
plt.show()

	id	country_id	league_id	season	stage	date	match_api_id	home_team_api_id	away_team_api_id	home_team_goal	...	SJA	VCH	VCD	VCA	GBH	GBD	GBA	BSH	BSD	BSA
0	1	1	1	2008/2009	1	2008-08-17 00:00:00	492473	9987	9993	1	...	4.00	1.65	3.40	4.50	1.78	3.25	4.00	1.73	3.40	4.20
1	2	1	1	2008/2009	1	2008-08-16 00:00:00	492474	10000	9994	0	...	3.80	2.00	3.25	3.25	1.85	3.25	3.75	1.91	3.25	3.60
2	3	1	1	2008/2009	1	2008-08-16 00:00:00	492475	9984	8635	0	...	2.50	2.35	3.25	2.65	2.50	3.20	2.50	2.30	3.20	2.75
3	4	1	1	2008/2009	1	2008-08-17 00:00:00	492476	9991	9998	5	...	7.50	1.45	3.75	6.50	1.50	3.75	5.50	1.44	3.75	6.50
4	5	1	1	2008/2009	1	2008-08-16 00:00:00	492477	7947	9985	1	...	1.73	4.50	3.40	1.65	4.50	3.50	1.65	4.75	3.30	1.67

	id	country_id	league_id	season	stage	date	match_api_id	home_team_api_id	away_team_api_id	home_team_goal	...	SJA	VCH	VCD	VCA	GBH	GBD	GBA	BSH	BSD	BSA
1728	1729	1729	1729	2008/2009	1	2008-08-17 00:00:00	489042	10260	10261	1	...	10.00	1.28	5.5	12.00	1.30	4.75	10.0	1.29	4.50	11.00
1729	1730	1729	1729	2008/2009	1	2008-08-16 00:00:00	489043	9825	8659	1	...	12.00	1.25	6.0	13.00	1.22	5.50	13.0	1.22	5.00	13.00
1730	1731	1729	1729	2008/2009	1	2008-08-16 00:00:00	489044	8472	8650	0	...	1.73	5.50	3.8	1.65	5.00	3.40	1.7	4.50	3.40	1.73
1731	1732	1729	1729	2008/2009	1	2008-08-16 00:00:00	489045	8654	8528	2	...	3.75	1.90	3.5	4.35	1.91	3.25	4.0	1.91	3.25	3.80
1732	1733	1729	1729	2008/2009	1	2008-08-17 00:00:00	489046	10252	8456	4	...	3.75	1.90	3.5	4.35	1.91	3.25	4.0	1.91	3.30	3.75
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4763	4764	1729	1729	2015/2016	9	2015-10-17 00:00:00	1988795	8466	8197	2	...	NaN	1.75	3.9	5.00	NaN	NaN	NaN	NaN	NaN	NaN
4764	4765	1729	1729	2015/2016	9	2015-10-19 00:00:00	1988796	10003	10194	0	...	NaN	2.05	3.4	4.10	NaN	NaN	NaN	NaN	NaN	NaN
4765	4766	1729	1729	2015/2016	9	2015-10-17 00:00:00	1988797	8586	8650	0	...	NaN	2.45	3.5	3.00	NaN	NaN	NaN	NaN	NaN	NaN
4766	4767	1729	1729	2015/2016	9	2015-10-17 00:00:00	1988798	9817	9825	0	...	NaN	6.25	4.2	1.60	NaN	NaN	NaN	NaN	NaN	NaN
4767	4768	1729	1729	2015/2016	9	2015-10-17 00:00:00	1988799	8659	8472	1	...	NaN	2.05	3.3	4.20	NaN	NaN	NaN	NaN	NaN	NaN

	id	country_id	league_id	season	stage	date	match_api_id	home_team_api_id	away_team_api_id	home_team_goal	...	away_player_2	away_player_3	away_player_4	away_player_5	away_player_6	away_player_7	away_player_8	away_player_9	away_player_10	away_player_11
0	1	1	1	2008/2009	1	2008-08-17 00:00:00	492473	9987	9993	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2	1	1	2008/2009	1	2008-08-16 00:00:00	492474	10000	9994	0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	3	1	1	2008/2009	1	2008-08-16 00:00:00	492475	9984	8635	0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	4	1	1	2008/2009	1	2008-08-17 00:00:00	492476	9991	9998	5	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	5	1	1	2008/2009	1	2008-08-16 00:00:00	492477	7947	9985	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	id	player_fifa_api_id	player_api_id	date	overall_rating	potential	preferred_foot	attacking_work_rate	defensive_work_rate	crossing	...	vision	penalties	marking	standing_tackle	sliding_tackle	gk_diving	gk_handling	gk_kicking	gk_positioning	gk_reflexes
0	1	218353	505942	2016-02-18 00:00:00	67.0	71.0	right	medium	medium	49.0	...	54.0	48.0	65.0	69.0	69.0	6.0	11.0	10.0	8.0	8.0
1	2	218353	505942	2015-11-19 00:00:00	67.0	71.0	right	medium	medium	49.0	...	54.0	48.0	65.0	69.0	69.0	6.0	11.0	10.0	8.0	8.0
2	3	218353	505942	2015-09-21 00:00:00	62.0	66.0	right	medium	medium	49.0	...	54.0	48.0	65.0	66.0	69.0	6.0	11.0	10.0	8.0	8.0
3	4	218353	505942	2015-03-20 00:00:00	61.0	65.0	right	medium	medium	48.0	...	53.0	47.0	62.0	63.0	66.0	5.0	10.0	9.0	7.0	7.0
4	5	218353	505942	2007-02-22 00:00:00	61.0	65.0	right	medium	medium	48.0	...	53.0	47.0	62.0	63.0	66.0	5.0	10.0	9.0	7.0	7.0

	id	country_id	league_id	season	stage	date	match_api_id	home_team_api_id	away_team_api_id	home_team_goal	...	away_player_2	away_player_3	away_player_4	away_player_5	away_player_6	away_player_7	away_player_8	away_player_9	away_player_10	away_player_11
0	1	1	1	2008/2009	1	2008-08-17 00:00:00	492473	9987	9993	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2	1	1	2008/2009	1	2008-08-16 00:00:00	492474	10000	9994	0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	3	1	1	2008/2009	1	2008-08-16 00:00:00	492475	9984	8635	0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	4	1	1	2008/2009	1	2008-08-17 00:00:00	492476	9991	9998	5	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	5	1	1	2008/2009	1	2008-08-16 00:00:00	492477	7947	9985	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
25974	25975	24558	24558	2015/2016	9	2015-09-22 00:00:00	1992091	10190	10191	1	...	563066.0	8800.0	67304.0	158253.0	133126.0	186524.0	93223.0	121115.0	232110.0	289732.0
25975	25976	24558	24558	2015/2016	9	2015-09-23 00:00:00	1992092	9824	10199	1	...	114792.0	150007.0	178119.0	27232.0	570830.0	260708.0	201704.0	36382.0	34082.0	95257.0
25976	25977	24558	24558	2015/2016	9	2015-09-23 00:00:00	1992093	9956	10179	2	...	67349.0	202663.0	32597.0	114794.0	188114.0	25840.0	482200.0	95230.0	451335.0	275122.0
25977	25978	24558	24558	2015/2016	9	2015-09-22 00:00:00	1992094	7896	10243	0	...	121080.0	197757.0	260964.0	231614.0	113235.0	41116.0	462608.0	42262.0	92252.0	194532.0
25978	25979	24558	24558	2015/2016	9	2015-09-23 00:00:00	1992095	10192	9931	4	...	95216.0	172768.0	22834.0	458806.0	207234.0	25772.0	40274.0	34035.0	41726.0	527103.0

AI/ML 기술 블로그

유럽 축구 경기 및 선수 특징 추출을 통한 승부 예측

데이터 소개¶

Step 0. 데이터베이스와 SQL¶

SQL과 Query¶

Step 1. 데이터셋 준비하기¶

문제 3. sqlite3와 Pandas로 sqlite 데이터베이스 읽어들이기¶

Step 2. EDA 및 데이터 기초 통계 분석¶

문제 4. 각 데이터프레임의 구조 파악하기¶

원하는 리그만 뽑아서 보기¶

문제 5. 데이터프레임간의 관계 파악하기¶

att에 중복되는 컬럼(선수)이 많으므로 선수들의 평균치를 내기¶

1개에서 6개라 그룹바이로 중복 없애기¶

nan 을 다 날리고 float 을 int로바꿈¶

문제 6. 선수 특ፑ#49457; 사이의 상관성 파악하기¶

문제 7. 매치 데이터프레임에 팀 특성 데이터프레임 통합하기¶

새로운 df 만들고 팀 정보 추가하기¶

문제 8. 홈과 어웨이의 골 수를 승-무-패 범주로 변환하기¶

Step 3. 모델 학습을 위한 데이터 전처리¶

문제 9. get_dummies를 이용하여 범주형 데이터 전처리하기¶

문제 10. StandardScaler를 이용해 수치형 데이터 표준화하기¶

문제 11. 학습데이터와 테스트데이터 분리하기¶

Step 4. Classification 모델 학습하기¶

문제 12. Logistic Regression 모델 생성/학습하기¶

문제 13. 모델 학습 결과 평가하기¶

전체중의 값을 보면 어떤 지표가 가장 비슷한 결과를 보였는지 알 수 있음 -> f1score¶

문제 14. XGBoost 모델 생성/학습하기¶

나름대로 절반 넘게 맞춤 -> 51 % 로 logistic 보다 향상¶

문제 15. 모델 학습 결과 평가하기¶

Step5 모델 학습 결과 심화 분석하기¶

문제 16. Logistic Regression 모델 계수로 상관성 파악하기¶

away가 강하게 수비하면 홈팀이 이길가능성 up -> 공격이 최고의 수비..¶

문제 17. XGBoost 모델로 특징의 중요도 확인하기¶

수치보다 범주가 영향을 많이 주는 것보면 -> 상성이 영향을 많이 끼침¶

생각보다 수치가 별로 중요한 영향을 안줌¶

'Project & Kaggle' 카테고리의 다른 글

'Project & Kaggle'의 다른글

티스토리툴바

	id	name
0	1	Belgium
1	1729	England
2	4769	France
3	7809	Germany
4	10257	Italy

	id	country_id	name
0	1	1	Belgium Jupiler League
1	1729	1729	England Premier League
2	4769	4769	France Ligue 1
3	7809	7809	Germany 1. Bundesliga
4	10257	10257	Italy Serie A

	buildUpPlaySpeed	buildUpPlaySpeedClass	buildUpPlayDribblingClass	buildUpPlayPassing	buildUpPlayPassingClass	buildUpPlayPositioningClass	chanceCreationPassing	chanceCreationPassingClass	chanceCreationCrossing	chanceCreationCrossingClass	chanceCreationShooting	chanceCreationShootingClass	chanceCreationPositioningClass	defencePressure	defencePressureClass	defenceAggression	defenceAggressionClass	defenceTeamWidth	defenceTeamWidthClass	defenceDefenderLineClass
team_api_id
1601	47.333333	Balanced	Little	46.166667	Mixed	Organised	55.500000	Normal	60.333333	Normal	53.500000	Normal	Organised	47.166667	Medium	47.333333	Press	49.333333	Normal	Cover
1773	52.000000	Balanced	Little	50.000000	Mixed	Organised	50.000000	Normal	53.000000	Normal	54.000000	Normal	Organised	43.000000	Medium	44.000000	Press	50.000000	Normal	Cover
1957	47.166667	Balanced	Little	52.666667	Mixed	Organised	50.833333	Normal	36.333333	Normal	67.000000	Normal	Organised	49.333333	Medium	56.333333	Press	53.666667	Normal	Cover
2033	49.000000	Balanced	Little	48.400000	Mixed	Organised	51.800000	Normal	44.200000	Normal	48.200000	Normal	Free Form	42.200000	Medium	33.800000	Contain	45.400000	Normal	Cover
2182	60.333333	Fast	Little	49.000000	Mixed	Organised	56.666667	Normal	57.333333	Normal	68.000000	Lots	Organised	51.666667	Medium	48.500000	Press	54.500000	Normal	Cover
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
10281	52.833333	Balanced	Little	53.666667	Mixed	Organised	50.333333	Normal	59.166667	Lots	50.333333	Normal	Organised	45.500000	Medium	51.166667	Press	52.666667	Normal	Cover
108893	43.166667	Balanced	Little	53.166667	Mixed	Organised	46.000000	Normal	38.333333	Normal	39.500000	Normal	Organised	38.333333	Medium	46.000000	Press	46.333333	Normal	Cover
158085	67.000000	Balanced	Normal	39.000000	Mixed	Organised	55.000000	Normal	59.000000	Normal	46.000000	Normal	Organised	36.000000	Medium	38.000000	Press	37.000000	Normal	Cover
208931	56.000000	Slow	Normal	48.000000	Short	Organised	58.500000	Risky	38.000000	Normal	52.000000	Normal	Organised	35.500000	Deep	49.500000	Press	43.000000	Normal	Cover
274581	50.000000	Balanced	Normal	50.000000	Mixed	Organised	50.000000	Normal	50.000000	Normal	50.000000	Normal	Organised	45.000000	Medium	45.000000	Press	50.000000	Normal	Cover

	home_team_goal	away_team_goal	home_buildUpPlaySpeed	home_buildUpPlaySpeedClass	home_buildUpPlayDribblingClass	home_buildUpPlayPassing	home_buildUpPlayPassingClass	home_buildUpPlayPositioningClass	home_chanceCreationPassing	home_chanceCreationPassingClass	...	away_chanceCreationShooting	away_chanceCreationShootingClass	away_chanceCreationPositioningClass	away_defencePressure	away_defencePressureClass	away_defenceAggression	away_defenceAggressionClass	away_defenceTeamWidth	away_defenceTeamWidthClass	away_defenceDefenderLineClass
0	1	1	56.333333	Balanced	Little	44.333333	Mixed	Organised	55.666667	Normal	...	53.750000	Normal	Organised	48.250000	Medium	49.000000	Press	58.750000	Wide	Cover
1	0	0	55.500000	Balanced	Little	52.666667	Mixed	Organised	50.000000	Normal	...	54.333333	Normal	Organised	48.833333	Medium	55.166667	Press	55.500000	Normal	Cover
2	0	3	53.666667	Balanced	Little	44.166667	Mixed	Organised	59.000000	Normal	...	51.333333	Normal	Organised	59.666667	Medium	47.666667	Press	63.833333	Normal	Cover
3	5	0	54.166667	Balanced	Little	46.333333	Mixed	Organised	44.500000	Normal	...	46.000000	Normal	Organised	46.000000	Medium	45.000000	Press	47.000000	Normal	Cover
4	1	3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	61.333333	Normal	Organised	51.500000	Medium	57.500000	Press	56.500000	Normal	Cover
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
25974	1	0	47.600000	Balanced	Little	51.800000	Mixed	Organised	39.800000	Normal	...	59.000000	Normal	Organised	45.000000	Medium	42.800000	Press	51.600000	Normal	Cover
25975	1	2	53.000000	Balanced	Little	56.000000	Mixed	Organised	38.000000	Normal	...	61.000000	Normal	Organised	48.000000	Medium	48.333333	Press	50.333333	Normal	Cover
25976	2	0	49.500000	Balanced	Little	53.166667	Mixed	Organised	53.000000	Normal	...	54.333333	Normal	Organised	42.666667	Medium	43.833333	Press	50.000000	Normal	Cover
25977	0	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	53.500000	Normal	Organised	48.666667	Medium	46.000000	Press	55.666667	Normal	Cover
25978	4	3	53.833333	Balanced	Little	63.000000	Mixed	Organised	46.000000	Normal	...	52.500000	Normal	Organised	47.666667	Medium	57.500000	Press	54.500000	Normal	Cover

유럽 축구 경기 및 선수 특징 추출을 통한 승부 예측

데이터 소개¶

Step 0. 데이터베이스와 SQL¶

SQL과 Query¶

Step 1. 데이터셋 준비하기¶

문제 3. sqlite3와 Pandas로 sqlite 데이터베이스 읽어들이기¶

Step 2. EDA 및 데이터 기초 통계 분석¶

문제 4. 각 데이터프레임의 구조 파악하기¶

원하는 리그만 뽑아서 보기¶

문제 5. 데이터프레임간의 관계 파악하기¶

att에 중복되는 컬럼(선수)이 많으므로 선수들의 평균치를 내기¶

1개에서 6개라 그룹바이로 중복 없애기¶

nan 을 다 날리고 float 을 int로바꿈¶

문제 6. 선수 특ፑ#49457; 사이의 상관성 파악하기¶

문제 7. 매치 데이터프레임에 팀 특성 데이터프레임 통합하기¶

새로운 df 만들고 팀 정보 추가하기¶

문제 8. 홈과 어웨이의 골 수를 승-무-패 범주로 변환하기¶

Step 3. 모델 학습을 위한 데이터 전처리¶

문제 9. get_dummies를 이용하여 범주형 데이터 전처리하기¶

문제 10. StandardScaler를 이용해 수치형 데이터 표준화하기¶

문제 11. 학습데이터와 테스트데이터 분리하기¶

Step 4. Classification 모델 학습하기¶

문제 12. Logistic Regression 모델 생성/학습하기¶

문제 13. 모델 학습 결과 평가하기¶

전체중의 값을 보면 어떤 지표가 가장 비슷한 결과를 보였는지 알 수 있음 -> f1score¶

문제 14. XGBoost 모델 생성/학습하기¶

나름대로 절반 넘게 맞춤 -> 51 % 로 logistic 보다 향상¶

문제 15. 모델 학습 결과 평가하기¶

Step5 모델 학습 결과 심화 분석하기¶

문제 16. Logistic Regression 모델 계수로 상관성 파악하기¶

away가 강하게 수비하면 홈팀이 이길가능성 up -> 공격이 최고의 수비..¶

문제 17. XGBoost 모델로 특징의 중요도 확인하기¶

수치보다 범주가 영향을 많이 주는 것보면 -> 상성이 영향을 많이 끼침¶

생각보다 수치가 별로 중요한 영향을 안줌¶

'Project & Kaggle' 카테고리의 다른 글

'Project & Kaggle'의 다른글

관련글

티스토리툴바