Project & Kaggle

유럽 축구 경기 및 선수 특징 추출을 통한 승부 예측

robin0309 2021. 4. 2. 12:48
유럽 축구 승부 예측

데이터 소개

- 이번 주제는 European Soccer Database 데이터셋을 사용합니다.

- 다음 1개의 sqlite 데이터베이스를 사용합니다.
database.sqlite

- 데이터 베이스 내 총 7개의 Table을 사용합니다.
Country: 국가 정보
League: 리그 정보
Match: 경기 정보 (주 데이터셋)
Player: 플레이어 정보
Player_Attributes: 플레이어의 특성
Team: 팀 정보
Team_Attributes: 팀의 특성

Step 0. 데이터베이스와 SQL

SQL과 Query

Step 1. 데이터셋 준비하기

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

문제 3. sqlite3와 Pandas로 sqlite 데이터베이스 읽어들이기

In [3]:
import sqlite3
In [4]:
# sqlite3.connect()와 pd.read_sql_query()로 csv파일 읽어들이기
conn = sqlite3.connect('database.sqlite')
 
In [90]:
df_country = pd.read_sql_query('SELECT * from Country', conn)
df_league = pd.read_sql_query('SELECT * from League', conn)
df_match = pd.read_sql_query('SELECT * from Match', conn)
df_player = pd.read_sql_query('SELECT * from Player', conn)
df_player_att = pd.read_sql_query('SELECT * from Player_Attributes', conn)
df_team = pd.read_sql_query('SELECT * from Team', conn)
df_team_att = pd.read_sql_query('SELECT * from Team_Attributes', conn)

Step 2. EDA 및 데이터 기초 통계 분석

문제 4. 각 데이터프레임의 구조 파악하기

In [91]:
# DataFrame에서 제공하는 메소드를 이용하여 각 데이터프레임의 구조 분석하기 (head(), info(), describe())
df_country.head()
Out[91]:
id name
0 1 Belgium
1 1729 England
2 4769 France
3 7809 Germany
4 10257 Italy
In [92]:
df_country.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      11 non-null     int64 
 1   name    11 non-null     object
dtypes: int64(1), object(1)
memory usage: 304.0+ bytes
In [93]:
df_league.head()
Out[93]:
id country_id name
0 1 1 Belgium Jupiler League
1 1729 1729 England Premier League
2 4769 4769 France Ligue 1
3 7809 7809 Germany 1. Bundesliga
4 10257 10257 Italy Serie A
In [94]:
df_league.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          11 non-null     int64 
 1   country_id  11 non-null     int64 
 2   name        11 non-null     object
dtypes: int64(2), object(1)
memory usage: 392.0+ bytes
In [95]:
df_league['id'].unique()
Out[95]:
array([    1,  1729,  4769,  7809, 10257, 13274, 15722, 17642, 19694,
       21518, 24558], dtype=int64)
In [96]:
df_league['country_id'].unique()
Out[96]:
array([    1,  1729,  4769,  7809, 10257, 13274, 15722, 17642, 19694,
       21518, 24558], dtype=int64)
In [97]:
df_match.head()
Out[97]:
id country_id league_id season stage date match_api_id home_team_api_id away_team_api_id home_team_goal ... SJA VCH VCD VCA GBH GBD GBA BSH BSD BSA
0 1 1 1 2008/2009 1 2008-08-17 00:00:00 492473 9987 9993 1 ... 4.00 1.65 3.40 4.50 1.78 3.25 4.00 1.73 3.40 4.20
1 2 1 1 2008/2009 1 2008-08-16 00:00:00 492474 10000 9994 0 ... 3.80 2.00 3.25 3.25 1.85 3.25 3.75 1.91 3.25 3.60
2 3 1 1 2008/2009 1 2008-08-16 00:00:00 492475 9984 8635 0 ... 2.50 2.35 3.25 2.65 2.50 3.20 2.50 2.30 3.20 2.75
3 4 1 1 2008/2009 1 2008-08-17 00:00:00 492476 9991 9998 5 ... 7.50 1.45 3.75 6.50 1.50 3.75 5.50 1.44 3.75 6.50
4 5 1 1 2008/2009 1 2008-08-16 00:00:00 492477 7947 9985 1 ... 1.73 4.50 3.40 1.65 4.50 3.50 1.65 4.75 3.30 1.67

5 rows × 115 columns

원하는 리그만 뽑아서 보기

In [98]:
df_match.loc[df_match['league_id']==1729]
Out[98]:
id country_id league_id season stage date match_api_id home_team_api_id away_team_api_id home_team_goal ... SJA VCH VCD VCA GBH GBD GBA BSH BSD BSA
1728 1729 1729 1729 2008/2009 1 2008-08-17 00:00:00 489042 10260 10261 1 ... 10.00 1.28 5.5 12.00 1.30 4.75 10.0 1.29 4.50 11.00
1729 1730 1729 1729 2008/2009 1 2008-08-16 00:00:00 489043 9825 8659 1 ... 12.00 1.25 6.0 13.00 1.22 5.50 13.0 1.22 5.00 13.00
1730 1731 1729 1729 2008/2009 1 2008-08-16 00:00:00 489044 8472 8650 0 ... 1.73 5.50 3.8 1.65 5.00 3.40 1.7 4.50 3.40 1.73
1731 1732 1729 1729 2008/2009 1 2008-08-16 00:00:00 489045 8654 8528 2 ... 3.75 1.90 3.5 4.35 1.91 3.25 4.0 1.91 3.25 3.80
1732 1733 1729 1729 2008/2009 1 2008-08-17 00:00:00 489046 10252 8456 4 ... 3.75 1.90 3.5 4.35 1.91 3.25 4.0 1.91 3.30 3.75
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4763 4764 1729 1729 2015/2016 9 2015-10-17 00:00:00 1988795 8466 8197 2 ... NaN 1.75 3.9 5.00 NaN NaN NaN NaN NaN NaN
4764 4765 1729 1729 2015/2016 9 2015-10-19 00:00:00 1988796 10003 10194 0 ... NaN 2.05 3.4 4.10 NaN NaN NaN NaN NaN NaN
4765 4766 1729 1729 2015/2016 9 2015-10-17 00:00:00 1988797 8586 8650 0 ... NaN 2.45 3.5 3.00 NaN NaN NaN NaN NaN NaN
4766 4767 1729 1729 2015/2016 9 2015-10-17 00:00:00 1988798 9817 9825 0 ... NaN 6.25 4.2 1.60 NaN NaN NaN NaN NaN NaN
4767 4768 1729 1729 2015/2016 9 2015-10-17 00:00:00 1988799 8659 8472 1 ... NaN 2.05 3.3 4.20 NaN NaN NaN NaN NaN NaN

3040 rows × 115 columns

In [99]:
df_match['season'].unique()
Out[99]:
array(['2008/2009', '2009/2010', '2010/2011', '2011/2012', '2012/2013',
       '2013/2014', '2014/2015', '2015/2016'], dtype=object)
In [100]:
for c in df_match.columns:
    print(c)
id
country_id
league_id
season
stage
date
match_api_id
home_team_api_id
away_team_api_id
home_team_goal
away_team_goal
home_player_X1
home_player_X2
home_player_X3
home_player_X4
home_player_X5
home_player_X6
home_player_X7
home_player_X8
home_player_X9
home_player_X10
home_player_X11
away_player_X1
away_player_X2
away_player_X3
away_player_X4
away_player_X5
away_player_X6
away_player_X7
away_player_X8
away_player_X9
away_player_X10
away_player_X11
home_player_Y1
home_player_Y2
home_player_Y3
home_player_Y4
home_player_Y5
home_player_Y6
home_player_Y7
home_player_Y8
home_player_Y9
home_player_Y10
home_player_Y11
away_player_Y1
away_player_Y2
away_player_Y3
away_player_Y4
away_player_Y5
away_player_Y6
away_player_Y7
away_player_Y8
away_player_Y9
away_player_Y10
away_player_Y11
home_player_1
home_player_2
home_player_3
home_player_4
home_player_5
home_player_6
home_player_7
home_player_8
home_player_9
home_player_10
home_player_11
away_player_1
away_player_2
away_player_3
away_player_4
away_player_5
away_player_6
away_player_7
away_player_8
away_player_9
away_player_10
away_player_11
goal
shoton
shotoff
foulcommit
card
cross
corner
possession
B365H
B365D
B365A
BWH
BWD
BWA
IWH
IWD
IWA
LBH
LBD
LBA
PSH
PSD
PSA
WHH
WHD
WHA
SJH
SJD
SJA
VCH
VCD
VCA
GBH
GBD
GBA
BSH
BSD
BSA
In [101]:
for c,num in zip(df_match.columns,df_match.isna().sum()):
    print(c,num)
id 0
country_id 0
league_id 0
season 0
stage 0
date 0
match_api_id 0
home_team_api_id 0
away_team_api_id 0
home_team_goal 0
away_team_goal 0
home_player_X1 1821
home_player_X2 1821
home_player_X3 1832
home_player_X4 1832
home_player_X5 1832
home_player_X6 1832
home_player_X7 1832
home_player_X8 1832
home_player_X9 1832
home_player_X10 1832
home_player_X11 1832
away_player_X1 1832
away_player_X2 1832
away_player_X3 1832
away_player_X4 1832
away_player_X5 1832
away_player_X6 1832
away_player_X7 1832
away_player_X8 1832
away_player_X9 1833
away_player_X10 1833
away_player_X11 1839
home_player_Y1 1821
home_player_Y2 1821
home_player_Y3 1832
home_player_Y4 1832
home_player_Y5 1832
home_player_Y6 1832
home_player_Y7 1832
home_player_Y8 1832
home_player_Y9 1832
home_player_Y10 1832
home_player_Y11 1832
away_player_Y1 1832
away_player_Y2 1832
away_player_Y3 1832
away_player_Y4 1832
away_player_Y5 1832
away_player_Y6 1832
away_player_Y7 1832
away_player_Y8 1832
away_player_Y9 1833
away_player_Y10 1833
away_player_Y11 1839
home_player_1 1224
home_player_2 1315
home_player_3 1281
home_player_4 1323
home_player_5 1316
home_player_6 1325
home_player_7 1227
home_player_8 1309
home_player_9 1273
home_player_10 1436
home_player_11 1555
away_player_1 1234
away_player_2 1278
away_player_3 1293
away_player_4 1321
away_player_5 1335
away_player_6 1313
away_player_7 1235
away_player_8 1341
away_player_9 1328
away_player_10 1441
away_player_11 1554
goal 11762
shoton 11762
shotoff 11762
foulcommit 11762
card 11762
cross 11762
corner 11762
possession 11762
B365H 3387
B365D 3387
B365A 3387
BWH 3404
BWD 3404
BWA 3404
IWH 3459
IWD 3459
IWA 3459
LBH 3423
LBD 3423
LBA 3423
PSH 14811
PSD 14811
PSA 14811
WHH 3408
WHD 3408
WHA 3408
SJH 8882
SJD 8882
SJA 8882
VCH 3411
VCD 3411
VCA 3411
GBH 11817
GBD 11817
GBA 11817
BSH 11818
BSD 11818
BSA 11818
In [102]:
df_match.drop(df_match.columns[-38:],axis=1,inplace=True)
In [103]:
df_match.head()
Out[103]:
id country_id league_id season stage date match_api_id home_team_api_id away_team_api_id home_team_goal ... away_player_2 away_player_3 away_player_4 away_player_5 away_player_6 away_player_7 away_player_8 away_player_9 away_player_10 away_player_11
0 1 1 1 2008/2009 1 2008-08-17 00:00:00 492473 9987 9993 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 1 1 2008/2009 1 2008-08-16 00:00:00 492474 10000 9994 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 3 1 1 2008/2009 1 2008-08-16 00:00:00 492475 9984 8635 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 4 1 1 2008/2009 1 2008-08-17 00:00:00 492476 9991 9998 5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 5 1 1 2008/2009 1 2008-08-16 00:00:00 492477 7947 9985 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 77 columns

In [104]:
df_player_att.head()
Out[104]:
id player_fifa_api_id player_api_id date overall_rating potential preferred_foot attacking_work_rate defensive_work_rate crossing ... vision penalties marking standing_tackle sliding_tackle gk_diving gk_handling gk_kicking gk_positioning gk_reflexes
0 1 218353 505942 2016-02-18 00:00:00 67.0 71.0 right medium medium 49.0 ... 54.0 48.0 65.0 69.0 69.0 6.0 11.0 10.0 8.0 8.0
1 2 218353 505942 2015-11-19 00:00:00 67.0 71.0 right medium medium 49.0 ... 54.0 48.0 65.0 69.0 69.0 6.0 11.0 10.0 8.0 8.0
2 3 218353 505942 2015-09-21 00:00:00 62.0 66.0 right medium medium 49.0 ... 54.0 48.0 65.0 66.0 69.0 6.0 11.0 10.0 8.0 8.0
3 4 218353 505942 2015-03-20 00:00:00 61.0 65.0 right medium medium 48.0 ... 53.0 47.0 62.0 63.0 66.0 5.0 10.0 9.0 7.0 7.0
4 5 218353 505942 2007-02-22 00:00:00 61.0 65.0 right medium medium 48.0 ... 53.0 47.0 62.0 63.0 66.0 5.0 10.0 9.0 7.0 7.0

5 rows × 42 columns

문제 5. 데이터프레임간의 관계 파악하기

In [105]:
# 데이터프레임 간 중복되는 Column이 있는지 확인하고 유용한 Column 식별하기
# Hint) unique()로 값을 비교하거나, map() 등을 활용하여 Column 관계를 확인
df_player_att['player_api_id'].value_counts()
Out[105]:
210278    56
41269     56
42116     55
26472     54
179795    53
          ..
163254     2
266280     2
37366      2
238766     2
26568      2
Name: player_api_id, Length: 11060, dtype: int64

att에 중복되는 컬럼(선수)이 많으므로 선수들의 평균치를 내기

In [106]:
df_match
Out[106]:
id country_id league_id season stage date match_api_id home_team_api_id away_team_api_id home_team_goal ... away_player_2 away_player_3 away_player_4 away_player_5 away_player_6 away_player_7 away_player_8 away_player_9 away_player_10 away_player_11
0 1 1 1 2008/2009 1 2008-08-17 00:00:00 492473 9987 9993 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 1 1 2008/2009 1 2008-08-16 00:00:00 492474 10000 9994 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 3 1 1 2008/2009 1 2008-08-16 00:00:00 492475 9984 8635 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 4 1 1 2008/2009 1 2008-08-17 00:00:00 492476 9991 9998 5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 5 1 1 2008/2009 1 2008-08-16 00:00:00 492477 7947 9985 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
25974 25975 24558 24558 2015/2016 9 2015-09-22 00:00:00 1992091 10190 10191 1 ... 563066.0 8800.0 67304.0 158253.0 133126.0 186524.0 93223.0 121115.0 232110.0 289732.0
25975 25976 24558 24558 2015/2016 9 2015-09-23 00:00:00 1992092 9824 10199 1 ... 114792.0 150007.0 178119.0 27232.0 570830.0 260708.0 201704.0 36382.0 34082.0 95257.0
25976 25977 24558 24558 2015/2016 9 2015-09-23 00:00:00 1992093 9956 10179 2 ... 67349.0 202663.0 32597.0 114794.0 188114.0 25840.0 482200.0 95230.0 451335.0 275122.0
25977 25978 24558 24558 2015/2016 9 2015-09-22 00:00:00 1992094 7896 10243 0 ... 121080.0 197757.0 260964.0 231614.0 113235.0 41116.0 462608.0 42262.0 92252.0 194532.0
25978 25979 24558 24558 2015/2016 9 2015-09-23 00:00:00 1992095 10192 9931 4 ... 95216.0 172768.0 22834.0 458806.0 207234.0 25772.0 40274.0 34035.0 41726.0 527103.0

25979 rows × 77 columns

In [107]:
df_match['home_player_1'].dropna().apply(int).map(df_player_att.groupby('player_api_id').mean()['overall_rating']).isna().sum()
Out[107]:
0

1개에서 6개라 그룹바이로 중복 없애기

In [108]:
df_team_att['team_api_id'].value_counts()
Out[108]:
8191    6
8668    6
8661    6
8659    6
8658    6
       ..
8613    1
6631    1
6351    1
8614    1
9824    1
Name: team_api_id, Length: 288, dtype: int64

nan 을 다 날리고 float 을 int로바꿈

In [109]:
df_match['away_team_api_id'].map(df_team_att.groupby('team_api_id').mean()['buildUpPlaySpeed']).isna().sum()
Out[109]:
178
In [ ]:
 

문제 6. 선수 특ፑ#49457; 사이의 상관성 파악하기

In [110]:
df_player_att['overall_rating']
Out[110]:
0         67.0
1         67.0
2         62.0
3         61.0
4         61.0
          ... 
183973    83.0
183974    78.0
183975    77.0
183976    78.0
183977    80.0
Name: overall_rating, Length: 183978, dtype: float64
In [111]:
fig=plt.figure(figsize=(5,15))
sns.heatmap(df_player_att.drop(['id', 'player_fifa_api_id', 'player_api_id'], axis=1).corr()[['overall_rating']], annot=True)
Out[111]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c410364940>

문제 7. 매치 데이터프레임에 팀 특성 데이터프레임 통합하기

In [112]:
# DataFrame의 map() 메소드를 활용하여 데이터프레임 통합하기

df_team_att.columns
Out[112]:
Index(['id', 'team_fifa_api_id', 'team_api_id', 'date', 'buildUpPlaySpeed',
       'buildUpPlaySpeedClass', 'buildUpPlayDribbling',
       'buildUpPlayDribblingClass', 'buildUpPlayPassing',
       'buildUpPlayPassingClass', 'buildUpPlayPositioningClass',
       'chanceCreationPassing', 'chanceCreationPassingClass',
       'chanceCreationCrossing', 'chanceCreationCrossingClass',
       'chanceCreationShooting', 'chanceCreationShootingClass',
       'chanceCreationPositioningClass', 'defencePressure',
       'defencePressureClass', 'defenceAggression', 'defenceAggressionClass',
       'defenceTeamWidth', 'defenceTeamWidthClass',
       'defenceDefenderLineClass'],
      dtype='object')
In [113]:
df_team_att.drop('buildUpPlayDribbling', axis=1, inplace=True)
In [114]:
df_team_att.columns
Out[114]:
Index(['id', 'team_fifa_api_id', 'team_api_id', 'date', 'buildUpPlaySpeed',
       'buildUpPlaySpeedClass', 'buildUpPlayDribblingClass',
       'buildUpPlayPassing', 'buildUpPlayPassingClass',
       'buildUpPlayPositioningClass', 'chanceCreationPassing',
       'chanceCreationPassingClass', 'chanceCreationCrossing',
       'chanceCreationCrossingClass', 'chanceCreationShooting',
       'chanceCreationShootingClass', 'chanceCreationPositioningClass',
       'defencePressure', 'defencePressureClass', 'defenceAggression',
       'defenceAggressionClass', 'defenceTeamWidth', 'defenceTeamWidthClass',
       'defenceDefenderLineClass'],
      dtype='object')
In [118]:
def most(x):
  return x.value_counts().index[0]
In [119]:
df_team_att['buildUpPlayPassingClass'].value_counts().index[0]
Out[119]:
'Mixed'
In [121]:
team_map=df_team_att.groupby
In [122]:
team_map = df_team_att.groupby('team_api_id').aggregate(
    {
       'buildUpPlaySpeed': 'mean',
       'buildUpPlaySpeedClass': most,
       'buildUpPlayDribblingClass': most,
       'buildUpPlayPassing': 'mean',
       'buildUpPlayPassingClass': most,
       'buildUpPlayPositioningClass': most,
       'chanceCreationPassing': 'mean',
       'chanceCreationPassingClass': most,
       'chanceCreationCrossing': 'mean',
       'chanceCreationCrossingClass': most,
       'chanceCreationShooting': 'mean',
       'chanceCreationShootingClass': most,
       'chanceCreationPositioningClass': most,
       'defencePressure': 'mean',
       'defencePressureClass': most,
       'defenceAggression': 'mean',
       'defenceAggressionClass': most,
       'defenceTeamWidth': 'mean',
       'defenceTeamWidthClass': most,
       'defenceDefenderLineClass': most
    }
)
In [ ]:
 
In [123]:
team_map
Out[123]:
buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribblingClass buildUpPlayPassing buildUpPlayPassingClass buildUpPlayPositioningClass chanceCreationPassing chanceCreationPassingClass chanceCreationCrossing chanceCreationCrossingClass chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
team_api_id
1601 47.333333 Balanced Little 46.166667 Mixed Organised 55.500000 Normal 60.333333 Normal 53.500000 Normal Organised 47.166667 Medium 47.333333 Press 49.333333 Normal Cover
1773 52.000000 Balanced Little 50.000000 Mixed Organised 50.000000 Normal 53.000000 Normal 54.000000 Normal Organised 43.000000 Medium 44.000000 Press 50.000000 Normal Cover
1957 47.166667 Balanced Little 52.666667 Mixed Organised 50.833333 Normal 36.333333 Normal 67.000000 Normal Organised 49.333333 Medium 56.333333 Press 53.666667 Normal Cover
2033 49.000000 Balanced Little 48.400000 Mixed Organised 51.800000 Normal 44.200000 Normal 48.200000 Normal Free Form 42.200000 Medium 33.800000 Contain 45.400000 Normal Cover
2182 60.333333 Fast Little 49.000000 Mixed Organised 56.666667 Normal 57.333333 Normal 68.000000 Lots Organised 51.666667 Medium 48.500000 Press 54.500000 Normal Cover
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10281 52.833333 Balanced Little 53.666667 Mixed Organised 50.333333 Normal 59.166667 Lots 50.333333 Normal Organised 45.500000 Medium 51.166667 Press 52.666667 Normal Cover
108893 43.166667 Balanced Little 53.166667 Mixed Organised 46.000000 Normal 38.333333 Normal 39.500000 Normal Organised 38.333333 Medium 46.000000 Press 46.333333 Normal Cover
158085 67.000000 Balanced Normal 39.000000 Mixed Organised 55.000000 Normal 59.000000 Normal 46.000000 Normal Organised 36.000000 Medium 38.000000 Press 37.000000 Normal Cover
208931 56.000000 Slow Normal 48.000000 Short Organised 58.500000 Risky 38.000000 Normal 52.000000 Normal Organised 35.500000 Deep 49.500000 Press 43.000000 Normal Cover
274581 50.000000 Balanced Normal 50.000000 Mixed Organised 50.000000 Normal 50.000000 Normal 50.000000 Normal Organised 45.000000 Medium 45.000000 Press 50.000000 Normal Cover

288 rows × 20 columns

새로운 df 만들고 팀 정보 추가하기

In [124]:
df = df_match[['home_team_goal', 'away_team_goal']].copy()
In [125]:
for team in ['home_', 'away_']:
  team_map.index.name = team + 'team_api_id'
  for col in team_map.columns:
    df[team + col] = df_match[team_map.index.name].map(team_map[col])
In [126]:
df
Out[126]:
home_team_goal away_team_goal home_buildUpPlaySpeed home_buildUpPlaySpeedClass home_buildUpPlayDribblingClass home_buildUpPlayPassing home_buildUpPlayPassingClass home_buildUpPlayPositioningClass home_chanceCreationPassing home_chanceCreationPassingClass ... away_chanceCreationShooting away_chanceCreationShootingClass away_chanceCreationPositioningClass away_defencePressure away_defencePressureClass away_defenceAggression away_defenceAggressionClass away_defenceTeamWidth away_defenceTeamWidthClass away_defenceDefenderLineClass
0 1 1 56.333333 Balanced Little 44.333333 Mixed Organised 55.666667 Normal ... 53.750000 Normal Organised 48.250000 Medium 49.000000 Press 58.750000 Wide Cover
1 0 0 55.500000 Balanced Little 52.666667 Mixed Organised 50.000000 Normal ... 54.333333 Normal Organised 48.833333 Medium 55.166667 Press 55.500000 Normal Cover
2 0 3 53.666667 Balanced Little 44.166667 Mixed Organised 59.000000 Normal ... 51.333333 Normal Organised 59.666667 Medium 47.666667 Press 63.833333 Normal Cover
3 5 0 54.166667 Balanced Little 46.333333 Mixed Organised 44.500000 Normal ... 46.000000 Normal Organised 46.000000 Medium 45.000000 Press 47.000000 Normal Cover
4 1 3 NaN NaN NaN NaN NaN NaN NaN NaN ... 61.333333 Normal Organised 51.500000 Medium 57.500000 Press 56.500000 Normal Cover
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
25974 1 0 47.600000 Balanced Little 51.800000 Mixed Organised 39.800000 Normal ... 59.000000 Normal Organised 45.000000 Medium 42.800000 Press 51.600000 Normal Cover
25975 1 2 53.000000 Balanced Little 56.000000 Mixed Organised 38.000000 Normal ... 61.000000 Normal Organised 48.000000 Medium 48.333333 Press 50.333333 Normal Cover
25976 2 0 49.500000 Balanced Little 53.166667 Mixed Organised 53.000000 Normal ... 54.333333 Normal Organised 42.666667 Medium 43.833333 Press 50.000000 Normal Cover
25977 0 0 NaN NaN NaN NaN NaN NaN NaN NaN ... 53.500000 Normal Organised 48.666667 Medium 46.000000 Press 55.666667 Normal Cover
25978 4 3 53.833333 Balanced Little 63.000000 Mixed Organised 46.000000 Normal ... 52.500000 Normal Organised 47.666667 Medium 57.500000 Press 54.500000 Normal Cover

25979 rows × 42 columns

In [127]:
df.dropna(inplace=True)
In [128]:
df.head()
Out[128]:
home_team_goal away_team_goal home_buildUpPlaySpeed home_buildUpPlaySpeedClass home_buildUpPlayDribblingClass home_buildUpPlayPassing home_buildUpPlayPassingClass home_buildUpPlayPositioningClass home_chanceCreationPassing home_chanceCreationPassingClass ... away_chanceCreationShooting away_chanceCreationShootingClass away_chanceCreationPositioningClass away_defencePressure away_defencePressureClass away_defenceAggression away_defenceAggressionClass away_defenceTeamWidth away_defenceTeamWidthClass away_defenceDefenderLineClass
0 1 1 56.333333 Balanced Little 44.333333 Mixed Organised 55.666667 Normal ... 53.750000 Normal Organised 48.250000 Medium 49.000000 Press 58.750000 Wide Cover
1 0 0 55.500000 Balanced Little 52.666667 Mixed Organised 50.000000 Normal ... 54.333333 Normal Organised 48.833333 Medium 55.166667 Press 55.500000 Normal Cover
2 0 3 53.666667 Balanced Little 44.166667 Mixed Organised 59.000000 Normal ... 51.333333 Normal Organised 59.666667 Medium 47.666667 Press 63.833333 Normal Cover
3 5 0 54.166667 Balanced Little 46.333333 Mixed Organised 44.500000 Normal ... 46.000000 Normal Organised 46.000000 Medium 45.000000 Press 47.000000 Normal Cover
5 1 1 56.666667 Balanced Little 47.333333 Mixed Organised 47.500000 Normal ... 47.666667 Normal Organised 53.666667 Medium 53.833333 Press 54.666667 Normal Cover

5 rows × 42 columns

문제 8. 홈과 어웨이의 골 수를 승-무-패 범주로 변환하기

In [129]:
# 홈과 어웨이의 골 수를 범주형 데이터로 변환하기 (0: 홈팀 승, 1: 무승부, 2: 어웨이팀 승)
df['matchResult'] = df[['home_team_goal', 'away_team_goal']].aggregate(lambda x: 0 if x[0] > x[1] else 1 if x[0] == x[1] else 2, axis=1)
In [130]:
df.drop(['home_team_goal','away_team_goal'],axis=1,inplace=True)
In [131]:
df['matchResult']
Out[131]:
0        1
1        1
2        2
3        0
5        1
        ..
25972    1
25974    0
25975    2
25976    0
25978    0
Name: matchResult, Length: 25629, dtype: int64

Step 3. 모델 학습을 위한 데이터 전처리

문제 9. get_dummies를 이용하여 범주형 데이터 전처리하기

In [132]:
col_cats=list(filter(lambda s: s.find('Class')>=0,df.columns))
col_cats
Out[132]:
['home_buildUpPlaySpeedClass',
 'home_buildUpPlayDribblingClass',
 'home_buildUpPlayPassingClass',
 'home_buildUpPlayPositioningClass',
 'home_chanceCreationPassingClass',
 'home_chanceCreationCrossingClass',
 'home_chanceCreationShootingClass',
 'home_chanceCreationPositioningClass',
 'home_defencePressureClass',
 'home_defenceAggressionClass',
 'home_defenceTeamWidthClass',
 'home_defenceDefenderLineClass',
 'away_buildUpPlaySpeedClass',
 'away_buildUpPlayDribblingClass',
 'away_buildUpPlayPassingClass',
 'away_buildUpPlayPositioningClass',
 'away_chanceCreationPassingClass',
 'away_chanceCreationCrossingClass',
 'away_chanceCreationShootingClass',
 'away_chanceCreationPositioningClass',
 'away_defencePressureClass',
 'away_defenceAggressionClass',
 'away_defenceTeamWidthClass',
 'away_defenceDefenderLineClass']
In [133]:
df_cats = pd.get_dummies(df[col_cats], drop_first=True)
In [134]:
df_cats
Out[134]:
home_buildUpPlaySpeedClass_Fast home_buildUpPlaySpeedClass_Slow home_buildUpPlayDribblingClass_Lots home_buildUpPlayDribblingClass_Normal home_buildUpPlayPassingClass_Mixed home_buildUpPlayPassingClass_Short home_buildUpPlayPositioningClass_Organised home_chanceCreationPassingClass_Risky home_chanceCreationPassingClass_Safe home_chanceCreationCrossingClass_Lots ... away_chanceCreationShootingClass_Lots away_chanceCreationShootingClass_Normal away_chanceCreationPositioningClass_Organised away_defencePressureClass_High away_defencePressureClass_Medium away_defenceAggressionClass_Double away_defenceAggressionClass_Press away_defenceTeamWidthClass_Normal away_defenceTeamWidthClass_Wide away_defenceDefenderLineClass_Offside Trap
0 0 0 0 0 1 0 1 0 0 0 ... 0 1 1 0 1 0 1 0 1 0
1 0 0 0 0 1 0 1 0 0 0 ... 0 1 1 0 1 0 1 1 0 0
2 0 0 0 0 1 0 1 0 0 0 ... 0 1 1 0 1 0 1 1 0 0
3 0 0 0 0 1 0 1 0 0 0 ... 0 1 1 0 1 0 1 1 0 0
5 0 0 0 0 1 0 1 0 0 0 ... 0 1 1 0 1 0 1 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
25972 0 0 0 0 1 0 1 0 0 0 ... 0 1 1 0 1 0 1 1 0 0
25974 0 0 0 0 1 0 1 0 0 0 ... 0 1 1 0 1 0 1 1 0 0
25975 0 0 0 0 1 0 1 0 0 0 ... 0 1 1 0 1 0 1 1 0 0
25976 0 0 0 0 1 0 1 0 0 0 ... 0 1 1 0 1 0 1 1 0 0
25978 0 0 0 0 1 0 1 0 0 1 ... 0 1 1 0 1 0 1 1 0 0

25629 rows × 42 columns

문제 10. StandardScaler를 이용해 수치형 데이터 표준화하기

In [135]:
from sklearn.preprocessing import StandardScaler
In [136]:
# StandardScaler를 이용해 수치형 데이터를 표준화하기
# Hint) Multicollinearity를 피하기 위해 불필요한 컬럼은 drop한다.

X_num = df.drop(['matchResult'] + col_cats, axis=1)
scaler = StandardScaler()
scaler.fit(X_num)
X_scaled = scaler.transform(X_num)
X_scaled = pd.DataFrame(data=X_scaled, index=X_num.index, columns=X_num.columns)

X_cat = df_cats
X = pd.concat([X_scaled, X_cat], axis=1)
y = df['matchResult']
In [137]:
df['matchResult'].value_counts()
Out[137]:
0    11756
2     7363
1     6510
Name: matchResult, dtype: int64

문제 11. 학습데이터와 테스트데이터 분리하기

In [138]:
from sklearn.model_selection import train_test_split
In [139]:
# train_test_split() 함수로 학습 데이터와 테스트 데이터 분리하기
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

Step 4. Classification 모델 학습하기

문제 12. Logistic Regression 모델 생성/학습하기

In [140]:
from sklearn.linear_model import LogisticRegression
In [141]:
# LogisticRegression 모델 생성/학습
model_lr = LogisticRegression(max_iter=10000)
model_lr.fit(X_train,y_train)
Out[141]:
LogisticRegression(max_iter=10000)

문제 13. 모델 학습 결과 평가하기

In [142]:
from sklearn.metrics import classification_report
In [143]:
# Predict를 수행하고 classification_report() 결과 출력하기
pred = model_lr.predict(X_test)
print(classification_report(y_test, pred))
              precision    recall  f1-score   support

           0       0.50      0.87      0.63      3484
           1       0.38      0.00      0.01      1972
           2       0.45      0.31      0.37      2233

    accuracy                           0.49      7689
   macro avg       0.44      0.40      0.34      7689
weighted avg       0.45      0.49      0.40      7689

전체중의 값을 보면 어떤 지표가 가장 비슷한 결과를 보였는지 알 수 있음 -> f1score

In [144]:
print(sum((y_test == 0)) / len(y_test))
print(sum((y_test == 1)) / len(y_test))
print(sum((y_test == 2)) / len(y_test))
0.4531148393809338
0.2564702822213552
0.290414878397711

문제 14. XGBoost 모델 생성/학습하기

나름대로 절반 넘게 맞춤 -> 51 % 로 logistic 보다 향상

In [145]:
from xgboost import XGBClassifier
In [146]:
# XGBClassifier 모델 생성/학습
model_xgb = XGBClassifier()
model_xgb.fit(X_train, y_train)
C:\Users\Administrator\anaconda3\lib\site-packages\xgboost\sklearn.py:888: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
  warnings.warn(label_encoder_deprecation_msg, UserWarning)
[11:13:19] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Out[146]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=8, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

문제 15. 모델 학습 결과 평가하기

In [148]:
# Predict를 수행하고 classification_report() 결과 출력하기
pred = model_xgb.predict(X_test)
print(classification_report(y_test, pred))
              precision    recall  f1-score   support

           0       0.54      0.72      0.61      3484
           1       0.29      0.15      0.20      1972
           2       0.45      0.40      0.42      2233

    accuracy                           0.48      7689
   macro avg       0.42      0.42      0.41      7689
weighted avg       0.45      0.48      0.45      7689

Step5 모델 학습 결과 심화 분석하기

문제 16. Logistic Regression 모델 계수로 상관성 파악하기

away가 강하게 수비하면 홈팀이 이길가능성 up -> 공격이 최고의 수비..

In [149]:
# Logistic Regression 모델의 coef_ 속성을 plot하기
fig = plt.figure(figsize=(15, 5))
plt.plot(X.columns, model_lr.coef_[0])
plt.xticks(rotation=90)
plt.title('What makes Home Team Win?')
plt.grid()
plt.show()

문제 17. XGBoost 모델로 특징의 중요도 확인하기

수치보다 범주가 영향을 많이 주는 것보면 -> 상성이 영향을 많이 끼침

생각보다 수치가 별로 중요한 영향을 안줌

In [150]:
# XGBoost 모델의 feature_importances_ 속성을 plot하기
fig = plt.figure(figsize=(15, 6))
plt.bar(X.columns, model_xgb.feature_importances_)
plt.xticks(rotation=90)
plt.show()
반응형
반응형