Project & Kaggle

철판 제조 공정 데이터를 활용한 분류모형 생성 및 성능 비교

robin0309 2021. 4. 7. 17:07
철판 제조 공정 데이터를 활용한 분류모형 생성 및 성능 비교

데이터 소개

Steel Plates Faults 데이터는 1941개의 샘플을 가지며 아래의 종속변수들과 나머지 설명변수들로 구성됩니다.

  • 종속변수 (7개) - 어떠한 불량이 나타났는지를 나타내고, 다음과 같습니다.

    • Pastry, Z_Scratch, K_Scatch, Stains, Dirtiness, Bumps, Other_Faults
  • 설명변수 (27개) - 철판의 길이, 반짝이는 정도, 두께, 타입 등 등 다양한 변수들을 가집니다.

    • 첫번째 칼럼 X_Minimum ~ 27번째 칼럼 SigmoidOfAreas
  • 데이터출처: https://www.kaggle.com/mahsateimourikia/faults-nna/notebooks

제조 공정 데이터의 전반적 특성

  • 제조 공정 데이터는 주로 불량률을 예측하여 불량을 일으키는 원인을 제거하거나 재고를 예측하여 수요에 맞는 생산을 진행하는 등의 목적성을 가집니다.
  • 다른 데이터에 비하여 데이터를 얻는 과정이 자동화되어 있는 경우가 많아 데이터 퀄리티가 높은 편이며 결측치가 적은 경향성을 가집니다.

아래와 같이 데이터를 준비합니다.

In [91]:
#Cpu의 개수를 확인합니다. 
n_cpu=os.cpu_count()
print("The number of cpus: ",n_cpu)
n_thread=n_cpu*2
print("Expected number of threads:",n_thread)
The number of cpus:  8
Expected number of threads: 16
In [5]:
import os
import glob
In [6]:
os.listdir()
Out[6]:
['.ipynb_checkpoints',
 'data',
 '[문제] Chapter 01_철판 제조 공정 데이터를 활용한 분류모형 생성 및 성능 비교-Copy1.ipynb',
 '[문제] Chapter 01_철판 제조 공정 데이터를 활용한 분류모형 생성 및 성능 비교.ipynb',
 '[해설] Chapter 01_철판 제조 공정 데이터를 활용한 분류모형 생성 및 성능 비교.ipynb']
In [7]:
import pandas as pd 
import numpy as np
In [8]:
# 데이터를 읽어옵니다.
df = pd.read_csv("data/Faults.NNA",  delimiter='\t', header=None)
df.head()
Out[8]:
0 1 2 3 4 5 6 7 8 9 ... 24 25 26 27 28 29 30 31 32 33
0 42 50 270900 270944 267 17 44 24220 76 108 ... 0.8182 -0.2913 0.5822 1 0 0 0 0 0 0
1 645 651 2538079 2538108 108 10 30 11397 84 123 ... 0.7931 -0.1756 0.2984 1 0 0 0 0 0 0
2 829 835 1553913 1553931 71 8 19 7972 99 125 ... 0.6667 -0.1228 0.2150 1 0 0 0 0 0 0
3 853 860 369370 369415 176 13 45 18996 99 126 ... 0.8444 -0.1568 0.5212 1 0 0 0 0 0 0
4 1289 1306 498078 498335 2409 60 260 246930 37 126 ... 0.9338 -0.1992 1.0000 1 0 0 0 0 0 0

5 rows × 34 columns

In [9]:
# 칼럼 레이블을 읽어와서 데이터 프레임의 칼럼명으로 지정합니다.
attributes_name=pd.read_csv("data/Faults27x7_var",  delimiter=' ', header=None)
df.columns=attributes_name[0]
In [10]:
df.columns
Out[10]:
Index(['X_Minimum', 'X_Maximum', 'Y_Minimum', 'Y_Maximum', 'Pixels_Areas',
       'X_Perimeter', 'Y_Perimeter', 'Sum_of_Luminosity',
       'Minimum_of_Luminosity', 'Maximum_of_Luminosity', 'Length_of_Conveyer',
       'TypeOfSteel_A300', 'TypeOfSteel_A400', 'Steel_Plate_Thickness',
       'Edges_Index', 'Empty_Index', 'Square_Index', 'Outside_X_Index',
       'Edges_X_Index', 'Edges_Y_Index', 'Outside_Global_Index', 'LogOfAreas',
       'Log_X_Index', 'Log_Y_Index', 'Orientation_Index', 'Luminosity_Index',
       'SigmoidOfAreas', 'Pastry', 'Z_Scratch', 'K_Scatch', 'Stains',
       'Dirtiness', 'Bumps', 'Other_Faults'],
      dtype='object', name=0)
In [11]:
df.head()
Out[11]:
X_Minimum X_Maximum Y_Minimum Y_Maximum Pixels_Areas X_Perimeter Y_Perimeter Sum_of_Luminosity Minimum_of_Luminosity Maximum_of_Luminosity ... Orientation_Index Luminosity_Index SigmoidOfAreas Pastry Z_Scratch K_Scatch Stains Dirtiness Bumps Other_Faults
0 42 50 270900 270944 267 17 44 24220 76 108 ... 0.8182 -0.2913 0.5822 1 0 0 0 0 0 0
1 645 651 2538079 2538108 108 10 30 11397 84 123 ... 0.7931 -0.1756 0.2984 1 0 0 0 0 0 0
2 829 835 1553913 1553931 71 8 19 7972 99 125 ... 0.6667 -0.1228 0.2150 1 0 0 0 0 0 0
3 853 860 369370 369415 176 13 45 18996 99 126 ... 0.8444 -0.1568 0.5212 1 0 0 0 0 0 0
4 1289 1306 498078 498335 2409 60 260 246930 37 126 ... 0.9338 -0.1992 1.0000 1 0 0 0 0 0 0

5 rows × 34 columns

In [13]:
df.describe()
Out[13]:
X_Minimum X_Maximum Y_Minimum Y_Maximum Pixels_Areas X_Perimeter Y_Perimeter Sum_of_Luminosity Minimum_of_Luminosity Maximum_of_Luminosity ... Orientation_Index Luminosity_Index SigmoidOfAreas Pastry Z_Scratch K_Scatch Stains Dirtiness Bumps Other_Faults
count 1941.000000 1941.000000 1.941000e+03 1.941000e+03 1941.000000 1941.000000 1941.000000 1.941000e+03 1941.000000 1941.000000 ... 1941.000000 1941.000000 1941.000000 1941.000000 1941.000000 1941.000000 1941.000000 1941.000000 1941.000000 1941.000000
mean 571.136012 617.964451 1.650685e+06 1.650739e+06 1893.878413 111.855229 82.965997 2.063121e+05 84.548686 130.193715 ... 0.083288 -0.131305 0.585420 0.081401 0.097888 0.201443 0.037094 0.028336 0.207110 0.346728
std 520.690671 497.627410 1.774578e+06 1.774590e+06 5168.459560 301.209187 426.482879 5.122936e+05 32.134276 18.690992 ... 0.500868 0.148767 0.339452 0.273521 0.297239 0.401181 0.189042 0.165973 0.405339 0.476051
min 0.000000 4.000000 6.712000e+03 6.724000e+03 2.000000 2.000000 1.000000 2.500000e+02 0.000000 37.000000 ... -0.991000 -0.998900 0.119000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 51.000000 192.000000 4.712530e+05 4.712810e+05 84.000000 15.000000 13.000000 9.522000e+03 63.000000 124.000000 ... -0.333300 -0.195000 0.248200 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 435.000000 467.000000 1.204128e+06 1.204136e+06 174.000000 26.000000 25.000000 1.920200e+04 90.000000 127.000000 ... 0.095200 -0.133000 0.506300 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 1053.000000 1072.000000 2.183073e+06 2.183084e+06 822.000000 84.000000 83.000000 8.301100e+04 106.000000 140.000000 ... 0.511600 -0.066600 0.999800 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
max 1705.000000 1713.000000 1.298766e+07 1.298769e+07 152655.000000 10449.000000 18152.000000 1.159141e+07 203.000000 253.000000 ... 0.991700 0.642100 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 34 columns

In [ ]:
 

종속변수가 7개니까 약간 원핫인코딩처럼 하나씩 1을 돌려가며 만들어준다는 뜻... 복잡

In [22]:
## 방법 1. 논리적 연산자 &를 활용하여 생성합니다.
conditions=[(df['Pastry'] == 1) & (df['Z_Scratch'] == 0)& (df['K_Scatch'] == 0)& (df['Stains'] == 0)& (df['Dirtiness'] == 0)& (df['Bumps'] == 0)& (df['Other_Faults'] == 0), 
            (df['Pastry'] == 0) & (df['Z_Scratch'] == 1)& (df['K_Scatch'] == 0)& (df['Stains'] == 0)& (df['Dirtiness'] == 0)& (df['Bumps'] == 0)& (df['Other_Faults'] == 0),
            (df['Pastry'] == 0) & (df['Z_Scratch'] == 0)& (df['K_Scatch'] == 1)& (df['Stains'] == 0)& (df['Dirtiness'] == 0)& (df['Bumps'] == 0)& (df['Other_Faults'] == 0),
            (df['Pastry'] == 0) & (df['Z_Scratch'] == 0)& (df['K_Scatch'] == 0)& (df['Stains'] == 1)& (df['Dirtiness'] == 0)& (df['Bumps'] == 0)& (df['Other_Faults'] == 0),
            (df['Pastry'] == 0) & (df['Z_Scratch'] == 0)& (df['K_Scatch'] == 0)& (df['Stains'] == 0)& (df['Dirtiness'] == 1)& (df['Bumps'] == 0)& (df['Other_Faults'] == 0),
            (df['Pastry'] == 0) & (df['Z_Scratch'] == 0)& (df['K_Scatch'] == 0)& (df['Stains'] == 0)& (df['Dirtiness'] == 0)& (df['Bumps'] == 1)& (df['Other_Faults'] == 0),
            (df['Pastry'] == 0) & (df['Z_Scratch'] == 0)& (df['K_Scatch'] == 0)& (df['Stains'] == 0)& (df['Dirtiness'] == 0)& (df['Bumps'] == 0)& (df['Other_Faults'] == 1)]

1은 True 0 은 False로 자동으로 바꿔주는 astype 함수를 사용

In [19]:
conditions
Out[19]:
[0        True
 1        True
 2        True
 3        True
 4        True
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936     True
 1937     True
 1938     True
 1939     True
 1940     True
 Length: 1941, dtype: bool]
In [29]:
## 방법 2. pandas.Series.astype을 활용합니다.
conditions=[df['Pastry'].astype(bool),
            df['Z_Scratch'].astype(bool),
            df['K_Scatch'].astype(bool),
            df['Stains'].astype(bool),
            df['Dirtiness'].astype(bool),
            df['Bumps'].astype(bool),
            df['Other_Faults'].astype(bool)]
In [30]:
conditions
Out[30]:
[0        True
 1        True
 2        True
 3        True
 4        True
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Name: Pastry, Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Name: Z_Scratch, Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Name: K_Scatch, Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Name: Stains, Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Name: Dirtiness, Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Name: Bumps, Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936     True
 1937     True
 1938     True
 1939     True
 1940     True
 Name: Other_Faults, Length: 1941, dtype: bool]

이 astype을 일괄적으로 적용하고 싶어 그런함수를 짠 것

In [31]:
## (문제) 방법 3. pandas.Series.astype과 map, lambda를 활용합니다
# conditions_bf에 각 변수들의 Seris로 list를 구성합니다.
# conditions_bf을 사용하고 map, lambda를 활용하여 conditions_bf의 각 원소에 astype 함수를 적용합니다.

conditions_bf=[
              df['Pastry'],
            df['Z_Scratch'],
            df['K_Scatch'],
            df['Stains'],
            df['Dirtiness'],
            df['Bumps'],
            df['Other_Faults'] 
]
conditions= list(map(lambda i: i.astype(bool), conditions_bf))
In [32]:
conditions_bf
Out[32]:
[0       1
 1       1
 2       1
 3       1
 4       1
        ..
 1936    0
 1937    0
 1938    0
 1939    0
 1940    0
 Name: Pastry, Length: 1941, dtype: int64,
 0       0
 1       0
 2       0
 3       0
 4       0
        ..
 1936    0
 1937    0
 1938    0
 1939    0
 1940    0
 Name: Z_Scratch, Length: 1941, dtype: int64,
 0       0
 1       0
 2       0
 3       0
 4       0
        ..
 1936    0
 1937    0
 1938    0
 1939    0
 1940    0
 Name: K_Scatch, Length: 1941, dtype: int64,
 0       0
 1       0
 2       0
 3       0
 4       0
        ..
 1936    0
 1937    0
 1938    0
 1939    0
 1940    0
 Name: Stains, Length: 1941, dtype: int64,
 0       0
 1       0
 2       0
 3       0
 4       0
        ..
 1936    0
 1937    0
 1938    0
 1939    0
 1940    0
 Name: Dirtiness, Length: 1941, dtype: int64,
 0       0
 1       0
 2       0
 3       0
 4       0
        ..
 1936    0
 1937    0
 1938    0
 1939    0
 1940    0
 Name: Bumps, Length: 1941, dtype: int64,
 0       0
 1       0
 2       0
 3       0
 4       0
        ..
 1936    1
 1937    1
 1938    1
 1939    1
 1940    1
 Name: Other_Faults, Length: 1941, dtype: int64]
In [33]:
print(type(conditions))
print(type(conditions[0]))
print(len(conditions))
print(len(conditions[0]))
<class 'list'>
<class 'pandas.core.series.Series'>
7
1941
In [34]:
choices = ['Pastry', 'Z_Scratch', 'K_Scatch', 'Stains', 'Dirtiness', 'Bumps', 'Other_Faults']
In [35]:
choices
Out[35]:
['Pastry',
 'Z_Scratch',
 'K_Scatch',
 'Stains',
 'Dirtiness',
 'Bumps',
 'Other_Faults']
In [36]:
df['class']=np.select(conditions,choices)
In [37]:
df['class']
Out[37]:
0             Pastry
1             Pastry
2             Pastry
3             Pastry
4             Pastry
            ...     
1936    Other_Faults
1937    Other_Faults
1938    Other_Faults
1939    Other_Faults
1940    Other_Faults
Name: class, Length: 1941, dtype: object

결국 class가 타겟 값이자 multiple classification이 목적

In [40]:
df.tail(50)
Out[40]:
X_Minimum X_Maximum Y_Minimum Y_Maximum Pixels_Areas X_Perimeter Y_Perimeter Sum_of_Luminosity Minimum_of_Luminosity Maximum_of_Luminosity ... Luminosity_Index SigmoidOfAreas Pastry Z_Scratch K_Scatch Stains Dirtiness Bumps Other_Faults class
1891 1292 1372 1506506 1506552 1973 134 80 240091 105 140 ... -0.0493 1.0000 0 0 0 0 0 0 1 Other_Faults
1892 123 140 1477978 1477986 69 19 12 8701 113 142 ... -0.0148 0.2482 0 0 0 0 0 0 1 Other_Faults
1893 1551 1576 2071735 2071746 143 36 25 17042 109 135 ... -0.0689 0.4547 0 0 0 0 0 0 1 Other_Faults
1894 1379 1396 2128940 2128956 126 33 22 16498 116 148 ... 0.0229 0.4498 0 0 0 0 0 0 1 Other_Faults
1895 773 801 2240409 2240422 207 34 23 26766 114 143 ... 0.0102 0.6015 0 0 0 0 0 0 1 Other_Faults
1896 547 588 2084580 2084619 860 57 44 95173 91 132 ... -0.1354 0.9998 0 0 0 0 0 0 1 Other_Faults
1897 1557 1584 2375780 2375805 339 79 40 39556 104 132 ... -0.0884 0.9231 0 0 0 0 0 0 1 Other_Faults
1898 691 710 664879 664883 55 19 6 7211 114 150 ... 0.0243 0.1812 0 0 0 0 0 0 1 Other_Faults
1899 1398 1445 684296 684315 381 91 47 48528 111 143 ... -0.0049 0.9809 0 0 0 0 0 0 1 Other_Faults
1900 900 916 693632 693638 53 17 8 6697 110 143 ... -0.0128 0.2018 0 0 0 0 0 0 1 Other_Faults
1901 832 848 712437 712446 67 29 12 8140 102 141 ... -0.0508 0.2583 0 0 0 0 0 0 1 Other_Faults
1902 1027 1048 787015 787022 76 22 10 9934 109 148 ... 0.0212 0.2621 0 0 0 0 0 0 1 Other_Faults
1903 1275 1300 883635 883640 73 28 14 9247 111 143 ... -0.0104 0.2348 0 0 0 0 0 0 1 Other_Faults
1904 983 1005 895625 895642 102 42 22 12942 114 141 ... -0.0087 0.6173 0 0 0 0 0 0 1 Other_Faults
1905 374 382 303204 303209 24 10 5 3166 124 140 ... 0.0306 0.1483 0 0 0 0 0 0 1 Other_Faults
1906 196 209 157035 157044 60 20 14 7508 117 135 ... -0.0224 0.2253 0 0 0 0 0 0 1 Other_Faults
1907 122 146 203252 203261 95 33 22 12566 124 143 ... 0.0334 0.3601 0 0 0 0 0 0 1 Other_Faults
1908 189 203 238753 238766 63 20 14 8140 118 141 ... 0.0094 0.3097 0 0 0 0 0 0 1 Other_Faults
1909 213 229 260092 260110 137 50 27 17621 120 140 ... 0.0049 0.4763 0 0 0 0 0 0 1 Other_Faults
1910 345 362 283416 283428 60 23 18 7914 124 143 ... 0.0305 0.3419 0 0 0 0 0 0 1 Other_Faults
1911 121 148 373848 373861 149 49 29 19691 122 143 ... 0.0325 0.5805 0 0 0 0 0 0 1 Other_Faults
1912 182 211 402237 402266 268 98 55 35565 125 143 ... 0.0368 0.9732 0 0 0 0 0 0 1 Other_Faults
1913 273 298 473262 473272 94 30 18 11877 117 143 ... -0.0129 0.4138 0 0 0 0 0 0 1 Other_Faults
1914 389 406 516427 516436 67 22 16 8596 120 143 ... 0.0023 0.2699 0 0 0 0 0 0 1 Other_Faults
1915 856 871 531684 531693 54 23 16 7145 125 143 ... 0.0337 0.2469 0 0 0 0 0 0 1 Other_Faults
1916 767 788 566837 566850 83 28 20 10345 115 141 ... -0.0263 0.4514 0 0 0 0 0 0 1 Other_Faults
1917 154 169 260124 260136 75 27 17 9948 125 143 ... 0.0362 0.3068 0 0 0 0 0 0 1 Other_Faults
1918 275 299 1133715 1133734 318 39 32 39911 116 140 ... -0.0195 0.7359 0 0 0 0 0 0 1 Other_Faults
1919 271 291 1488866 1488885 204 38 22 23155 105 126 ... -0.1132 0.6268 0 0 0 0 0 0 1 Other_Faults
1920 214 247 1538494 1538509 209 66 46 27037 118 143 ... 0.0106 0.7833 0 0 0 0 0 0 1 Other_Faults
1921 270 301 1563477 1563499 329 61 37 41053 115 140 ... -0.0251 0.9263 0 0 0 0 0 0 1 Other_Faults
1922 21 47 2867489 2867513 281 43 28 31838 104 126 ... -0.1148 0.8952 0 0 0 0 0 0 1 Other_Faults
1923 23 43 2985008 2985026 170 42 20 19738 107 125 ... -0.0929 0.5951 0 0 0 0 0 0 1 Other_Faults
1924 21 45 3063739 3063768 277 52 42 33476 113 128 ... -0.0558 0.9324 0 0 0 0 0 0 1 Other_Faults
1925 21 39 3411415 3411438 201 46 28 23925 112 126 ... -0.0701 0.6781 0 0 0 0 0 0 1 Other_Faults
1926 1053 1063 173208 173226 68 26 18 8717 120 141 ... 0.0015 0.3068 0 0 0 0 0 0 1 Other_Faults
1927 160 170 252936 252958 97 24 23 12775 123 142 ... 0.0289 0.3663 0 0 0 0 0 0 1 Other_Faults
1928 1303 1344 33298 33323 452 44 26 42686 65 117 ... -0.2622 0.9920 0 0 0 0 0 0 1 Other_Faults
1929 130 153 197701 197718 240 47 18 30130 118 140 ... -0.0192 0.6438 0 0 0 0 0 0 1 Other_Faults
1930 247 272 244048 244067 235 36 22 29911 119 142 ... -0.0056 0.7598 0 0 0 0 0 0 1 Other_Faults
1931 523 567 266325 266337 209 67 30 26833 119 141 ... 0.0030 0.8183 0 0 0 0 0 0 1 Other_Faults
1932 239 269 276029 276047 299 51 22 37820 116 140 ... -0.0118 0.8299 0 0 0 0 0 0 1 Other_Faults
1933 367 422 289647 289665 355 116 58 46882 123 143 ... 0.0317 0.9899 0 0 0 0 0 0 1 Other_Faults
1934 137 170 301492 301511 304 59 26 35778 111 126 ... -0.0805 0.8971 0 0 0 0 0 0 1 Other_Faults
1935 238 287 315114 315142 671 91 39 86424 119 143 ... 0.0062 0.9992 0 0 0 0 0 0 1 Other_Faults
1936 249 277 325780 325796 273 54 22 35033 119 141 ... 0.0026 0.7254 0 0 0 0 0 0 1 Other_Faults
1937 144 175 340581 340598 287 44 24 34599 112 133 ... -0.0582 0.8173 0 0 0 0 0 0 1 Other_Faults
1938 145 174 386779 386794 292 40 22 37572 120 140 ... 0.0052 0.7079 0 0 0 0 0 0 1 Other_Faults
1939 137 170 422497 422528 419 97 47 52715 117 140 ... -0.0171 0.9919 0 0 0 0 0 0 1 Other_Faults
1940 1261 1281 87951 87967 103 26 22 11682 101 133 ... -0.1139 0.5296 0 0 0 0 0 0 1 Other_Faults

50 rows × 35 columns

EDA 결측치

In [42]:
df.isnull().sum()
Out[42]:
0
X_Minimum                0
X_Maximum                0
Y_Minimum                0
Y_Maximum                0
Pixels_Areas             0
X_Perimeter              0
Y_Perimeter              0
Sum_of_Luminosity        0
Minimum_of_Luminosity    0
Maximum_of_Luminosity    0
Length_of_Conveyer       0
TypeOfSteel_A300         0
TypeOfSteel_A400         0
Steel_Plate_Thickness    0
Edges_Index              0
Empty_Index              0
Square_Index             0
Outside_X_Index          0
Edges_X_Index            0
Edges_Y_Index            0
Outside_Global_Index     0
LogOfAreas               0
Log_X_Index              0
Log_Y_Index              0
Orientation_Index        0
Luminosity_Index         0
SigmoidOfAreas           0
Pastry                   0
Z_Scratch                0
K_Scatch                 0
Stains                   0
Dirtiness                0
Bumps                    0
Other_Faults             0
class                    0
dtype: int64

제조공정특징 - null이 없음

말이 안되는 값이 있는지 살펴보기

In [45]:
round(df.describe(),2)
Out[45]:
X_Minimum X_Maximum Y_Minimum Y_Maximum Pixels_Areas X_Perimeter Y_Perimeter Sum_of_Luminosity Minimum_of_Luminosity Maximum_of_Luminosity ... Orientation_Index Luminosity_Index SigmoidOfAreas Pastry Z_Scratch K_Scatch Stains Dirtiness Bumps Other_Faults
count 1941.00 1941.00 1941.00 1941.00 1941.00 1941.00 1941.00 1941.00 1941.00 1941.00 ... 1941.00 1941.00 1941.00 1941.00 1941.0 1941.0 1941.00 1941.00 1941.00 1941.00
mean 571.14 617.96 1650684.87 1650738.71 1893.88 111.86 82.97 206312.15 84.55 130.19 ... 0.08 -0.13 0.59 0.08 0.1 0.2 0.04 0.03 0.21 0.35
std 520.69 497.63 1774578.41 1774590.09 5168.46 301.21 426.48 512293.59 32.13 18.69 ... 0.50 0.15 0.34 0.27 0.3 0.4 0.19 0.17 0.41 0.48
min 0.00 4.00 6712.00 6724.00 2.00 2.00 1.00 250.00 0.00 37.00 ... -0.99 -1.00 0.12 0.00 0.0 0.0 0.00 0.00 0.00 0.00
25% 51.00 192.00 471253.00 471281.00 84.00 15.00 13.00 9522.00 63.00 124.00 ... -0.33 -0.20 0.25 0.00 0.0 0.0 0.00 0.00 0.00 0.00
50% 435.00 467.00 1204128.00 1204136.00 174.00 26.00 25.00 19202.00 90.00 127.00 ... 0.10 -0.13 0.51 0.00 0.0 0.0 0.00 0.00 0.00 0.00
75% 1053.00 1072.00 2183073.00 2183084.00 822.00 84.00 83.00 83011.00 106.00 140.00 ... 0.51 -0.07 1.00 0.00 0.0 0.0 0.00 0.00 0.00 1.00
max 1705.00 1713.00 12987661.00 12987692.00 152655.00 10449.00 18152.00 11591414.00 203.00 253.00 ... 0.99 0.64 1.00 1.00 1.0 1.0 1.00 1.00 1.00 1.00

8 rows × 34 columns

stain과 dirtiness가 너무적어 모델링시 퍼포먼스가 나오지 않을 확률이 높음

In [44]:
df['class'].value_counts()
Out[44]:
Other_Faults    673
Bumps           402
K_Scatch        391
Z_Scratch       190
Pastry          158
Stains           72
Dirtiness        55
Name: class, dtype: int64

산점도를 통한 변수간의 상관관계 파악

In [47]:
import matplotlib.pyplot as plt
In [48]:
color_code = {'Pastry':'Red', 'Z_Scratch':'Blue', 'K_Scatch':'Green', 'Stains':'Black', 'Dirtiness':'Pink', 'Bumps':'Brown', 'Other_Faults':'Gold'}
In [49]:
color_code
Out[49]:
{'Pastry': 'Red',
 'Z_Scratch': 'Blue',
 'K_Scatch': 'Green',
 'Stains': 'Black',
 'Dirtiness': 'Pink',
 'Bumps': 'Brown',
 'Other_Faults': 'Gold'}

get 함수 -> key를 쓰면 value 를 반환

In [56]:
color_list=[color_code.get(i) for i in df.loc[:,'class']]
In [57]:
color_list
Out[57]:
['Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 ...]
In [59]:
# (문제) pandas.plotting.scatter_matrix, 위에서 만든 color_list를 활용해 scatter plot을 그리고 대각원소에는 히스토그램을 출력해봅니다. figsize= [30,30], alpha=0.3,s = 50 으로 지정합니다.
pd.plotting.scatter_matrix(df.loc[:,df.columns!='class'], c=color_list, figsize= [30,30], alpha=0.3,s = 50, diagonal='hist')
Out[59]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002247A7E9280>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002247A61E370>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002247A7FDEB0>,
        ...,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000022479F3F9A0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000022479D08E20>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000022478EB99D0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000022477A01070>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000022477F51340>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000022477AB2310>,
        ...,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002247B43E400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002247B465B80>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002247B49B340>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002247B4C4AC0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002247B4F82B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002247B51FA30>,
        ...,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002247BA6B7F0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002247BA94F70>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002247BAC8790>],
       ...,
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002240B5B5310>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240B5DEA90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240B614250>,
        ...,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240CB2AF10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240CB5D6D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240CB88E50>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002240CBBC610>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240CBE5D90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240CC1A550>,
        ...,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240D16E250>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240D1979D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240D1CB190>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002240D1F4910>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240D21C130>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240D253850>,
        ...,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240D7A4550>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240D7CFCD0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240D803490>]],
      dtype=object)

scatter plot - 변수들 상관관계 및 선형관계인지아닌지 변수의 조합 (범주의 구분)

In [62]:
import seaborn as sns
sns.set_style('white')
In [64]:
g=sns.factorplot(x='class',data=df,kind='count',palette='YlGnBu',size=6)
g.ax.xaxis.set_label_text("Type of Defect")
g.ax.yaxis.set_label_text("Count")
g.ax.set_title("The number of Defects by Defect type")
C:\Users\Administrator\anaconda3\lib\site-packages\seaborn\categorical.py:3714: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.
  warnings.warn(msg)
C:\Users\Administrator\anaconda3\lib\site-packages\seaborn\categorical.py:3720: UserWarning: The `size` parameter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)
Out[64]:
Text(0.5, 1.0, 'The number of Defects by Defect type')

결함의 종류이므로 -> type of defect 로 정의

bar 값에 count값 출력하기

In [66]:
# 이전 cell에서 완성한 코드를 복사 붙여넣기 합니다.
g=sns.factorplot(x='class',data=df,kind='count',palette='YlGnBu',size=6)
g.ax.xaxis.set_label_text("Type of Defect")
g.ax.yaxis.set_label_text("Count")
g.ax.set_title("The number of Defects by Defect type")
# (문제) Barplot의 bar 상단에 값을 text로 달아줍니다.
for p in g.ax.patches:
  g.ax.annotate((p.get_height()),(p.get_x()+0.2,p.get_height()+10))
C:\Users\Administrator\anaconda3\lib\site-packages\seaborn\categorical.py:3714: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.
  warnings.warn(msg)
C:\Users\Administrator\anaconda3\lib\site-packages\seaborn\categorical.py:3720: UserWarning: The `size` parameter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)

상관계수를 활용한 변수간의 상관관계 파악

In [67]:
df_corTarget = df[['X_Minimum', 'X_Maximum', 'Y_Minimum', 'Y_Maximum', 'Pixels_Areas',
       'X_Perimeter', 'Y_Perimeter', 'Sum_of_Luminosity',
       'Minimum_of_Luminosity', 'Maximum_of_Luminosity', 'Length_of_Conveyer',
       'TypeOfSteel_A300', 'TypeOfSteel_A400', 'Steel_Plate_Thickness',
       'Edges_Index', 'Empty_Index', 'Square_Index', 'Outside_X_Index',
       'Edges_X_Index', 'Edges_Y_Index', 'Outside_Global_Index', 'LogOfAreas',
       'Log_X_Index', 'Log_Y_Index', 'Orientation_Index', 'Luminosity_Index',
       'SigmoidOfAreas']]
In [68]:
corr=df_corTarget.corr()
corr
Out[68]:
X_Minimum X_Maximum Y_Minimum Y_Maximum Pixels_Areas X_Perimeter Y_Perimeter Sum_of_Luminosity Minimum_of_Luminosity Maximum_of_Luminosity ... Outside_X_Index Edges_X_Index Edges_Y_Index Outside_Global_Index LogOfAreas Log_X_Index Log_Y_Index Orientation_Index Luminosity_Index SigmoidOfAreas
0
X_Minimum 1.000000 0.988314 0.041821 0.041807 -0.307322 -0.258937 -0.118757 -0.339045 0.237637 -0.075554 ... -0.361160 0.154778 0.367907 0.147282 -0.428553 -0.437944 -0.326851 0.178585 -0.031578 -0.355251
X_Maximum 0.988314 1.000000 0.052147 0.052135 -0.225399 -0.186326 -0.090138 -0.247052 0.168649 -0.062392 ... -0.214930 0.149259 0.271915 0.099253 -0.332169 -0.324012 -0.265990 0.115019 -0.038996 -0.286736
Y_Minimum 0.041821 0.052147 1.000000 1.000000 0.017670 0.023843 0.024150 0.007362 -0.065703 -0.067785 ... 0.054165 0.066085 -0.036543 -0.062911 0.044952 0.070406 -0.008442 -0.086497 -0.090654 0.025257
Y_Maximum 0.041807 0.052135 1.000000 1.000000 0.017840 0.024038 0.024380 0.007499 -0.065733 -0.067776 ... 0.054185 0.066051 -0.036549 -0.062901 0.044994 0.070432 -0.008382 -0.086480 -0.090666 0.025284
Pixels_Areas -0.307322 -0.225399 0.017670 0.017840 1.000000 0.966644 0.827199 0.978952 -0.497204 0.110063 ... 0.588606 -0.294673 -0.463571 -0.109655 0.650234 0.603072 0.578342 -0.137604 -0.043449 0.422947
X_Perimeter -0.258937 -0.186326 0.023843 0.024038 0.966644 1.000000 0.912436 0.912956 -0.400427 0.111363 ... 0.517098 -0.293039 -0.412100 -0.079106 0.563036 0.524716 0.523472 -0.101731 -0.032617 0.380605
Y_Perimeter -0.118757 -0.090138 0.024150 0.024380 0.827199 0.912436 1.000000 0.704876 -0.213758 0.061809 ... 0.209160 -0.195162 -0.136723 0.013438 0.294040 0.228485 0.344378 0.031381 -0.047778 0.191772
Sum_of_Luminosity -0.339045 -0.247052 0.007362 0.007499 0.978952 0.912956 0.704876 1.000000 -0.540566 0.136515 ... 0.658339 -0.327728 -0.529745 -0.121090 0.712128 0.667736 0.618795 -0.158483 -0.014067 0.464248
Minimum_of_Luminosity 0.237637 0.168649 -0.065703 -0.065733 -0.497204 -0.400427 -0.213758 -0.540566 1.000000 0.429605 ... -0.487574 0.252256 0.316610 0.035462 -0.678762 -0.567655 -0.588208 0.057123 0.669534 -0.514797
Maximum_of_Luminosity -0.075554 -0.062392 -0.067785 -0.067776 0.110063 0.111363 0.061809 0.136515 0.429605 1.000000 ... 0.099300 0.093522 -0.167441 -0.124039 0.007672 0.092823 -0.069522 -0.169747 0.870160 -0.039651
Length_of_Conveyer 0.316662 0.299390 -0.049211 -0.049219 -0.155853 -0.134240 -0.063825 -0.169331 -0.023579 -0.098009 ... -0.217417 0.123585 0.235732 0.128663 -0.193247 -0.219973 -0.157057 0.120715 -0.149769 -0.197543
TypeOfSteel_A300 0.144319 0.112009 0.075164 0.075151 -0.235591 -0.189250 -0.095154 -0.263632 0.042048 -0.216339 ... -0.244765 0.173836 0.240634 0.022142 -0.329614 -0.266955 -0.311796 0.010630 -0.252818 -0.308910
TypeOfSteel_A400 -0.144319 -0.112009 -0.075164 -0.075151 0.235591 0.189250 0.095154 0.263632 -0.042048 0.216339 ... 0.244765 -0.173836 -0.240634 -0.022142 0.329614 0.266955 0.311796 -0.010630 0.252818 0.308910
Steel_Plate_Thickness 0.136625 0.106119 -0.207640 -0.207644 -0.183735 -0.147712 -0.058889 -0.204812 0.103393 -0.128397 ... -0.228352 -0.077408 0.251985 0.221244 -0.176639 -0.252822 -0.037287 0.274097 -0.116499 -0.085159
Edges_Index 0.278075 0.242846 0.021314 0.021300 -0.275289 -0.227590 -0.111240 -0.301452 0.358915 0.149675 ... -0.296510 0.250178 0.285302 0.008282 -0.408619 -0.355853 -0.371989 0.020548 0.207516 -0.330006
Empty_Index -0.198461 -0.152680 -0.043117 -0.043085 0.272808 0.306348 0.188825 0.293691 -0.044111 0.031425 ... 0.334996 -0.389342 -0.459800 -0.165293 0.356685 0.448864 0.397289 -0.139420 0.061608 0.481738
Square_Index 0.063658 0.048575 -0.006135 -0.006152 0.017865 0.004507 -0.047511 0.049607 0.066748 0.065517 ... -0.113627 0.242779 0.081488 -0.069913 -0.189340 -0.082846 -0.257661 -0.162034 0.111977 -0.292251
Outside_X_Index -0.361160 -0.214930 0.054165 0.054185 0.588606 0.517098 0.209160 0.658339 -0.487574 0.099300 ... 1.000000 -0.076663 -0.689867 -0.337173 0.710837 0.820223 0.464860 -0.440358 -0.035721 0.518910
Edges_X_Index 0.154778 0.149259 0.066085 0.066051 -0.294673 -0.293039 -0.195162 -0.327728 0.252256 0.093522 ... -0.076663 1.000000 0.108144 -0.419383 -0.496206 -0.189262 -0.748892 -0.550302 0.126460 -0.558426
Edges_Y_Index 0.367907 0.271915 -0.036543 -0.036549 -0.463571 -0.412100 -0.136723 -0.529745 0.316610 -0.167441 ... -0.689867 0.108144 1.000000 0.537565 -0.642991 -0.855414 -0.321892 0.658049 -0.094368 -0.545393
Outside_Global_Index 0.147282 0.099253 -0.062911 -0.062901 -0.109655 -0.079106 0.013438 -0.121090 0.035462 -0.124039 ... -0.337173 -0.419383 0.537565 1.000000 -0.097762 -0.428060 0.241898 0.862670 -0.122321 -0.053770
LogOfAreas -0.428553 -0.332169 0.044952 0.044994 0.650234 0.563036 0.294040 0.712128 -0.678762 0.007672 ... 0.710837 -0.496206 -0.642991 -0.097762 1.000000 0.888919 0.882974 -0.123898 -0.175879 0.877768
Log_X_Index -0.437944 -0.324012 0.070406 0.070432 0.603072 0.524716 0.228485 0.667736 -0.567655 0.092823 ... 0.820223 -0.189262 -0.855414 -0.428060 0.888919 1.000000 0.598652 -0.536629 -0.064923 0.757343
Log_Y_Index -0.326851 -0.265990 -0.008442 -0.008382 0.578342 0.523472 0.344378 0.618795 -0.588208 -0.069522 ... 0.464860 -0.748892 -0.321892 0.241898 0.882974 0.598652 1.000000 0.316792 -0.219110 0.838188
Orientation_Index 0.178585 0.115019 -0.086497 -0.086480 -0.137604 -0.101731 0.031381 -0.158483 0.057123 -0.169747 ... -0.440358 -0.550302 0.658049 0.862670 -0.123898 -0.536629 0.316792 1.000000 -0.153464 -0.023978
Luminosity_Index -0.031578 -0.038996 -0.090654 -0.090666 -0.043449 -0.032617 -0.047778 -0.014067 0.669534 0.870160 ... -0.035721 0.126460 -0.094368 -0.122321 -0.175879 -0.064923 -0.219110 -0.153464 1.000000 -0.184840
SigmoidOfAreas -0.355251 -0.286736 0.025257 0.025284 0.422947 0.380605 0.191772 0.464248 -0.514797 -0.039651 ... 0.518910 -0.558426 -0.545393 -0.053770 0.877768 0.757343 0.838188 -0.023978 -0.184840 1.000000

27 rows × 27 columns

vmax vmin 은 가장 진한 값이 1인지 -1인지 정하는것

In [69]:
# heatmap을 그리기 위한 파라미터들 설정
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(1,200, as_cmap=True)

# (문제) 저장해둔 corr과 mask, cmap을 활용하여 correlation을 표현하는 heatmap을 그립니다. correlation에 맞게 최대, 최소, 중간값을 지정해줍니다.
# linewidths=2로 설정합니다. 그림 크기는 figsize=(11,9)로 설정합니다.
sns.heatmap(corr,mask=mask,cmap=cmap,vmax=1,vmin=-1,center=0,linewidths=2)
Out[69]:
<matplotlib.axes._subplots.AxesSubplot at 0x2241ec867c0>

Training, Test set 분리하기 y를 k_scatch로 지정해보기

In [71]:
x = df[['X_Minimum', 'X_Maximum', 'Y_Minimum', 'Y_Maximum', 'Pixels_Areas',
       'X_Perimeter', 'Y_Perimeter', 'Sum_of_Luminosity',
       'Minimum_of_Luminosity', 'Maximum_of_Luminosity', 'Length_of_Conveyer',
       'TypeOfSteel_A300',  'Steel_Plate_Thickness',
       'Edges_Index', 'Empty_Index', 'Square_Index', 'Outside_X_Index',
       'Edges_X_Index', 'Edges_Y_Index', 'Outside_Global_Index', 'LogOfAreas',
       'Log_X_Index', 'Log_Y_Index', 'Orientation_Index', 'Luminosity_Index',
       'SigmoidOfAreas']]
y = df['K_Scatch']
In [72]:
from sklearn.model_selection import train_test_split
from scipy.stats import zscore
In [73]:
# (문제) sklearn.model_selection.train_test_split을 활용하여, x_train, x_test, y_train, y_test로 데이터를 나눕니다
# 그 비율은 8:2로 합니다. y값에 따라 stratify하여 나눕니다.
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=1, stratify=y)

z 스코어는 정규화하는것 (표준화)

In [74]:
# (문제) pandas.DataFrame.apply를 활용하여  x_train과 x_test를 표준화합니다.
x_train = x_train.apply(zscore)
x_test = x_test.apply(zscore)
In [ ]:
 
In [77]:
round(x_train.describe(),2)
Out[77]:
X_Minimum X_Maximum Y_Minimum Y_Maximum Pixels_Areas X_Perimeter Y_Perimeter Sum_of_Luminosity Minimum_of_Luminosity Maximum_of_Luminosity ... Outside_X_Index Edges_X_Index Edges_Y_Index Outside_Global_Index LogOfAreas Log_X_Index Log_Y_Index Orientation_Index Luminosity_Index SigmoidOfAreas
count 1552.00 1552.00 1552.00 1552.00 1552.00 1552.00 1552.00 1552.00 1552.00 1552.00 ... 1552.00 1552.00 1552.00 1552.00 1552.00 1552.00 1552.00 1552.00 1552.00 1552.00
mean -0.00 -0.00 -0.00 -0.00 -0.00 -0.00 0.00 -0.00 0.00 -0.00 ... -0.00 0.00 -0.00 -0.00 0.00 0.00 -0.00 0.00 0.00 0.00
std 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 ... 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
min -1.10 -1.24 -0.91 -0.91 -0.35 -0.34 -0.18 -0.39 -2.64 -5.07 ... -0.53 -2.47 -3.28 -1.19 -2.77 -2.16 -3.07 -2.14 -5.84 -1.37
25% -0.99 -0.86 -0.66 -0.66 -0.33 -0.30 -0.15 -0.37 -0.66 -0.35 ... -0.44 -0.80 -0.93 -1.19 -0.71 -0.70 -0.70 -0.83 -0.44 -1.00
50% -0.26 -0.30 -0.25 -0.25 -0.32 -0.27 -0.13 -0.35 0.17 -0.19 ... -0.39 0.10 0.58 0.89 -0.32 -0.33 -0.17 0.02 -0.01 -0.24
75% 0.93 0.92 0.29 0.29 -0.20 -0.09 -0.01 -0.23 0.66 0.52 ... -0.16 0.76 0.79 0.89 0.53 0.38 0.73 0.84 0.43 1.23
max 2.18 2.20 6.16 6.16 27.59 31.93 38.17 21.41 3.66 6.60 ... 13.97 1.58 0.79 0.89 3.42 3.63 6.27 1.83 5.18 1.23

8 rows × 26 columns

9. [로지스틱 회귀분석] 로지스틱 기본 모형 만들기

In [85]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn import metrics
In [86]:
#### liblinear 를 해야 릿지 라쏘를 구하는 것이 가능
In [87]:
lm=LogisticRegression(solver='liblinear')
로지스틱에서 고려해야할 Penalty의 형태 (Ridge, Lasso), regularization parameter range를 설정하여 이를 parameters에 dictionary 형태로 저장합니다.
In [88]:
parameters={'penalty':['l1','l2'],'C':[0.01,0.1,0.5,0.9,1,5,10],'tol':[1e-4,1e-2,1,1e2]}

그리드서치 - > 최적의 파라미터 구하기위해

In [92]:
GSLR=GridSearchCV(lm,parameters,cv=10,n_jobs=n_thread,scoring="accuracy")
In [93]:
GSLR.fit(x_train,y_train)
Out[93]:
GridSearchCV(cv=10, estimator=LogisticRegression(solver='liblinear'), n_jobs=16,
             param_grid={'C': [0.01, 0.1, 0.5, 0.9, 1, 5, 10],
                         'penalty': ['l1', 'l2'],
                         'tol': [0.0001, 0.01, 1, 100.0]},
             scoring='accuracy')

원래는 validation 을 따로 떼서 튜닝을 해야하는데 validation set을 따로 떼면 Training set 손실이 일어나므로 그렇게 하지 않고 과적합도 방지하기위해 CV를 사용

In [95]:
# 최적의 파라미터 값 및 정확도 (Accuracy) 출력
print('final params', GSLR.best_params_)   
print('best score', GSLR.best_score_)  
final params {'C': 1, 'penalty': 'l2', 'tol': 0.0001}
best score 0.9722911497105045
In [97]:
predicted=GSLR.predict(x_test)
In [98]:
predicted
Out[98]:
array([0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)
In [99]:
cMatrix = confusion_matrix(y_test,predicted)
print(cMatrix)
print("\n Accuracy:", GSLR.score(x_test,y_test))
[[305   6]
 [  6  72]]

 Accuracy: 0.9691516709511568
In [100]:
#  sklearn.metrics.classification_report를 활용하여 report를 출력합니다.
print(metrics.classification_report(y_test,predicted))
              precision    recall  f1-score   support

           0       0.98      0.98      0.98       311
           1       0.92      0.92      0.92        78

    accuracy                           0.97       389
   macro avg       0.95      0.95      0.95       389
weighted avg       0.97      0.97      0.97       389

In [101]:
# Cross validation 과정에서 계산된 정확도 값들을 출력해줍니다.
means = GSLR.cv_results_['mean_test_score']
stds = GSLR.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, GSLR.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
          % (mean, std * 2, params))
print()
0.945 (+/-0.037) for {'C': 0.01, 'penalty': 'l1', 'tol': 0.0001}
0.945 (+/-0.037) for {'C': 0.01, 'penalty': 'l1', 'tol': 0.01}
0.945 (+/-0.037) for {'C': 0.01, 'penalty': 'l1', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.01, 'penalty': 'l1', 'tol': 100.0}
0.949 (+/-0.031) for {'C': 0.01, 'penalty': 'l2', 'tol': 0.0001}
0.949 (+/-0.031) for {'C': 0.01, 'penalty': 'l2', 'tol': 0.01}
0.952 (+/-0.034) for {'C': 0.01, 'penalty': 'l2', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.01, 'penalty': 'l2', 'tol': 100.0}
0.964 (+/-0.028) for {'C': 0.1, 'penalty': 'l1', 'tol': 0.0001}
0.964 (+/-0.027) for {'C': 0.1, 'penalty': 'l1', 'tol': 0.01}
0.955 (+/-0.019) for {'C': 0.1, 'penalty': 'l1', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.1, 'penalty': 'l1', 'tol': 100.0}
0.966 (+/-0.021) for {'C': 0.1, 'penalty': 'l2', 'tol': 0.0001}
0.966 (+/-0.021) for {'C': 0.1, 'penalty': 'l2', 'tol': 0.01}
0.957 (+/-0.030) for {'C': 0.1, 'penalty': 'l2', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.1, 'penalty': 'l2', 'tol': 100.0}
0.969 (+/-0.028) for {'C': 0.5, 'penalty': 'l1', 'tol': 0.0001}
0.969 (+/-0.022) for {'C': 0.5, 'penalty': 'l1', 'tol': 0.01}
0.958 (+/-0.032) for {'C': 0.5, 'penalty': 'l1', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.5, 'penalty': 'l1', 'tol': 100.0}
0.968 (+/-0.023) for {'C': 0.5, 'penalty': 'l2', 'tol': 0.0001}
0.968 (+/-0.022) for {'C': 0.5, 'penalty': 'l2', 'tol': 0.01}
0.957 (+/-0.032) for {'C': 0.5, 'penalty': 'l2', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.5, 'penalty': 'l2', 'tol': 100.0}
0.969 (+/-0.030) for {'C': 0.9, 'penalty': 'l1', 'tol': 0.0001}
0.969 (+/-0.028) for {'C': 0.9, 'penalty': 'l1', 'tol': 0.01}
0.959 (+/-0.024) for {'C': 0.9, 'penalty': 'l1', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.9, 'penalty': 'l1', 'tol': 100.0}
0.972 (+/-0.023) for {'C': 0.9, 'penalty': 'l2', 'tol': 0.0001}
0.970 (+/-0.023) for {'C': 0.9, 'penalty': 'l2', 'tol': 0.01}
0.957 (+/-0.032) for {'C': 0.9, 'penalty': 'l2', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.9, 'penalty': 'l2', 'tol': 100.0}
0.970 (+/-0.030) for {'C': 1, 'penalty': 'l1', 'tol': 0.0001}
0.970 (+/-0.027) for {'C': 1, 'penalty': 'l1', 'tol': 0.01}
0.957 (+/-0.035) for {'C': 1, 'penalty': 'l1', 'tol': 1}
0.798 (+/-0.005) for {'C': 1, 'penalty': 'l1', 'tol': 100.0}
0.972 (+/-0.024) for {'C': 1, 'penalty': 'l2', 'tol': 0.0001}
0.972 (+/-0.025) for {'C': 1, 'penalty': 'l2', 'tol': 0.01}
0.957 (+/-0.032) for {'C': 1, 'penalty': 'l2', 'tol': 1}
0.798 (+/-0.005) for {'C': 1, 'penalty': 'l2', 'tol': 100.0}
0.970 (+/-0.027) for {'C': 5, 'penalty': 'l1', 'tol': 0.0001}
0.971 (+/-0.028) for {'C': 5, 'penalty': 'l1', 'tol': 0.01}
0.952 (+/-0.036) for {'C': 5, 'penalty': 'l1', 'tol': 1}
0.798 (+/-0.005) for {'C': 5, 'penalty': 'l1', 'tol': 100.0}
0.972 (+/-0.028) for {'C': 5, 'penalty': 'l2', 'tol': 0.0001}
0.971 (+/-0.028) for {'C': 5, 'penalty': 'l2', 'tol': 0.01}
0.958 (+/-0.030) for {'C': 5, 'penalty': 'l2', 'tol': 1}
0.798 (+/-0.005) for {'C': 5, 'penalty': 'l2', 'tol': 100.0}
0.972 (+/-0.025) for {'C': 10, 'penalty': 'l1', 'tol': 0.0001}
0.970 (+/-0.026) for {'C': 10, 'penalty': 'l1', 'tol': 0.01}
0.961 (+/-0.043) for {'C': 10, 'penalty': 'l1', 'tol': 1}
0.798 (+/-0.005) for {'C': 10, 'penalty': 'l1', 'tol': 100.0}
0.972 (+/-0.024) for {'C': 10, 'penalty': 'l2', 'tol': 0.0001}
0.971 (+/-0.024) for {'C': 10, 'penalty': 'l2', 'tol': 0.01}
0.958 (+/-0.030) for {'C': 10, 'penalty': 'l2', 'tol': 1}
0.798 (+/-0.005) for {'C': 10, 'penalty': 'l2', 'tol': 100.0}

[의사결정나무] 의사결정나무 기본 모형 만들기

In [102]:
from sklearn.tree import DecisionTreeClassifier
In [103]:
# (문제) 의사결정나무 모형을 만들어 dt에 저장합니다.
dt=DecisionTreeClassifier()
In [104]:
# (문제) 의사결정나무에서 고려해야할 criterion, min_samples_split, max_depth, min_samples_leaf, max_features 등을 고려하여 Grid search를 수행합니다.
# GridSearchCV의 옵션은 cv=10, n_jobs=n_thread, scoreing="accuracy"로 설정합니다.
parameters={'criterion':['gini','entropy'],'min_samples_split':[2,5,10,15], 'max_depth':[None,2],'min_samples_leaf':[1,3,10,15],'max_features':[None,'sqrt','log2']}
In [105]:
GSDT=GridSearchCV(dt,parameters,cv=10,n_jobs=n_thread,scoring="accuracy")
GSDT.fit(x_train,y_train)
Out[105]:
GridSearchCV(cv=10, estimator=DecisionTreeClassifier(), n_jobs=16,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [None, 2],
                         'max_features': [None, 'sqrt', 'log2'],
                         'min_samples_leaf': [1, 3, 10, 15],
                         'min_samples_split': [2, 5, 10, 15]},
             scoring='accuracy')
In [106]:
print('final params',GSDT.best_params_)
print('ACC.',GSDT.best_score_)
final params {'criterion': 'entropy', 'max_depth': None, 'max_features': None, 'min_samples_leaf': 3, 'min_samples_split': 10}
ACC. 0.9787386269644335
In [107]:
# (문제) predict 함수를 활용하여 예측 값을 구해 이를 predicted 에 저장하고 이를 출력하며 classification_report 또한 출력합니다.
predicted=GSDT.predict(x_test)
cMatrix = confusion_matrix(y_test,predicted)
print(cMatrix)
print(round(GSDT.score(x_test,y_test),3))
print(metrics.classification_report(y_test,predicted))
[[308   3]
 [  8  70]]
0.972
              precision    recall  f1-score   support

           0       0.97      0.99      0.98       311
           1       0.96      0.90      0.93        78

    accuracy                           0.97       389
   macro avg       0.97      0.94      0.95       389
weighted avg       0.97      0.97      0.97       389

In [108]:
# Train에서의 종속변수의 분포
print(y_train.value_counts())
0    1239
1     313
Name: K_Scatch, dtype: int64
In [111]:
# 트리 시각화
import graphviz
dt2=DecisionTreeClassifier(criterion='entropy',max_depth=None,max_features=None,min_samples_leaf=1,min_samples_split=5)
dt2.fit(x_train,y_train)
dot_data=tree.export_graphviz(dt2,feature_names=x_train.columns,filled=True,rounded=True)
graph=graphviz.Source(dot_data)
graph
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-111-ab07a0e6b331> in <module>
      3 dt2=DecisionTreeClassifier(criterion='entropy',max_depth=None,max_features=None,min_samples_leaf=1,min_samples_split=5)
      4 dt2.fit(x_train,y_train)
----> 5 dot_data=tree.export_graphviz(dt2,feature_names=x_train.columns,filled=True,rounded=True)
      6 graph=graphviz.Source(dot_data)
      7 graph

NameError: name 'tree' is not defined

Random Forest

  • Random Forest는 아래의 Bagging과 Drop-out을 활용하여 의사결정나무의 변동성을 완화시키고 예측력을 높인 모델이다.
    • Bootstrapping: 복원추출을 통하여 샘플 구성이 조금씩 다른 여러 데이터셋을 생성해냄.
    • Aggregating: 여러 모형의 결과를 통합하여 모형의 변동성을 낮춤.
    • Drop-out: Tree를 구성할 때 변수를 일부 탈락시킴. Tree간의 correlation을 감소시켜 이 또한 모형의 변동성을 낮춤.
In [113]:
from sklearn.ensemble import RandomForestClassifier
In [114]:
rf=RandomForestClassifier()
In [116]:
# Random Forest에서 고려해야할 n_estimators, min_samples_split, max_depth, min_samples_leaf, max_features 등을 고려하여 Grid search를 수행합니다.
# GridSearchCV의 옵션은 cv=10, n_jobs=n_thread, scoreing="accuracy"로 설정합니다.
parameters={'n_estimators':[50,100],'criterion':['entropy'],'min_samples_split':[2,5],'max_depth':[None,2],'min_samples_leaf':[1,3,10],'max_features':['sqrt']}
GSRF=GridSearchCV(rf,parameters,cv=10,n_jobs=n_thread,scoring="accuracy")
GSRF.fit(x_train,y_train)
Out[116]:
GridSearchCV(cv=10, estimator=RandomForestClassifier(), n_jobs=16,
             param_grid={'criterion': ['entropy'], 'max_depth': [None, 2],
                         'max_features': ['sqrt'],
                         'min_samples_leaf': [1, 3, 10],
                         'min_samples_split': [2, 5],
                         'n_estimators': [50, 100]},
             scoring='accuracy')
In [117]:
print('final params',GSRF.best_params_)
print('best score',GSRF.best_score_)
final params {'criterion': 'entropy', 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
best score 0.9838875103391231
In [118]:
predicted=GSRF.predict(x_test)
cMatrix=confusion_matrix(y_test,predicted)
print(cMatrix)
print(metrics.classification_report(y_test,predicted))
[[311   0]
 [  5  73]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       311
           1       1.00      0.94      0.97        78

    accuracy                           0.99       389
   macro avg       0.99      0.97      0.98       389
weighted avg       0.99      0.99      0.99       389

Support Vector Machine

  • 노란색 margin을 최대화하는 boundary를 찾는 것이 목표.
  • Error를 허용하는 정도를 C로 표현한다.
    • C가 크면 Error를 많이 허용하고, C가 작을 수록 Error를 적게 허용한다.
In [121]:
from sklearn import svm
In [122]:
svc=svm.SVC()
In [124]:
# (문제) Support Vector Machine에서 고려해야할 C, kernel, gamma 등을 고려하여 Grid search를 수행합니다.
# GridSearchCV의 옵션은 cv=10, n_jobs=n_thread, scoreing="accuracy"로 설정합니다.
parameters={'C':[0.01,0.1,0.5,0.9,1,5,10],'kernel':['linear','rbf','poly'],'gamma':[0.1,1,10]}
GS_SVM=GridSearchCV(svc,parameters,cv=10,n_jobs=n_thread,scoring="accuracy")
GS_SVM.fit(x_train,y_train)
Out[124]:
GridSearchCV(cv=10, estimator=SVC(), n_jobs=16,
             param_grid={'C': [0.01, 0.1, 0.5, 0.9, 1, 5, 10],
                         'gamma': [0.1, 1, 10],
                         'kernel': ['linear', 'rbf', 'poly']},
             scoring='accuracy')
In [125]:
# (문제) Support Vector Machine에서 고려해야할 C, kernel, gamma 등을 고려하여 Grid search를 수행합니다.
# GridSearchCV의 옵션은 cv=10, n_jobs=n_thread, scoreing="accuracy"로 설정합니다.
parameters={'C':[0.01,0.1,0.5,0.9,1,5,10],'kernel':['linear','rbf','poly'],'gamma':[0.1,1,10]}
GS_SVM=GridSearchCV(svc,parameters,cv=10,n_jobs=n_thread,scoring="accuracy")
GS_SVM.fit(x_train,y_train)
Out[125]:
GridSearchCV(cv=10, estimator=SVC(), n_jobs=16,
             param_grid={'C': [0.01, 0.1, 0.5, 0.9, 1, 5, 10],
                         'gamma': [0.1, 1, 10],
                         'kernel': ['linear', 'rbf', 'poly']},
             scoring='accuracy')
In [126]:
print('final params',GS_SVM.best_params_)
print('best score',GS_SVM.best_score_)
final params {'C': 5, 'gamma': 0.1, 'kernel': 'rbf'}
best score 0.9826137303556658
In [127]:
# (문제) predict 함수를 활용하여 예측 값을 구해 이를 predicted 에 저장하고 이를 출력하며 classification_report 또한 출력합니다.
predicted=GS_SVM.predict(x_test)
cMatrix=confusion_matrix(y_test,predicted)
print(cMatrix)
print(metrics.classification_report(y_test,predicted))
[[310   1]
 [  7  71]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       311
           1       0.99      0.91      0.95        78

    accuracy                           0.98       389
   macro avg       0.98      0.95      0.97       389
weighted avg       0.98      0.98      0.98       389

신경망 모형

  • 신경망 모형은 위와 같이 입력 데이터를 종합하여 결과값을 내는 구조를 가진 Perceptron을 중첩시키고 혼합시킨 구조이다. 아래와 같이 두 부분으로 나누어볼 수 있다.
    • 입력값들의 선형합 구조인 transfer function
    • activation function f()
  • 이 때 입력값은 다른 perceptron의 출력값이 될 수 있으며 이것이 중첩되면 아래와 같이 나타날 수 있으며 이를 신경망 모형이라 한다.
    • Input Layer: 입력 데이터가 위치하는 layer.
    • Hidden Layer: 입력 데이터 혹은 또 다른 hidden layer의 출력값을 입력값으로 하는 perceptron이 위치하는 layer.
    • Output Layer:마지막 hidden layer의 출력값을 입력값고 출력함수의 결과를 얻은 노드로 구성된 layer.
In [129]:
from sklearn.neural_network import MLPClassifier
In [130]:
# (문제) 신경망 모형을 만들어 ann_model에 저장합니다.
nn_model=MLPClassifier(random_state=1)
In [131]:
x_train.shape
Out[131]:
(1552, 26)
In [132]:
# (문제) 신경망 모형에서 고려해야할 alpha, hidden_layer_sizes, activation등을 고려하여 Grid search를 수행합니다.
# GridSearchCV의 옵션은 cv=10, n_jobs=n_thread, scoreing="accuracy"로 설정합니다.
parameters={'alpha':[1e-3,1e-1,1e1],'hidden_layer_sizes':[(5),(30),(60)],'activation':['tanh','relu'],'solver':['adam','lbfgs']}
GS_NN=GridSearchCV(nn_model,parameters,cv=10,n_jobs=n_thread,scoring="accuracy")
GS_NN.fit(x_train,y_train)
Out[132]:
GridSearchCV(cv=10, estimator=MLPClassifier(random_state=1), n_jobs=16,
             param_grid={'activation': ['tanh', 'relu'],
                         'alpha': [0.001, 0.1, 10.0],
                         'hidden_layer_sizes': [5, 30, 60],
                         'solver': ['adam', 'lbfgs']},
             scoring='accuracy')
In [133]:
print('final params', GS_NN.best_params_)
print('best score', GS_NN.best_score_)
final params {'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': 30, 'solver': 'lbfgs'}
best score 0.9793796526054592
In [134]:
means = GS_NN.cv_results_['mean_test_score']
stds = GS_NN.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, GS_NN.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
          % (mean, std * 2, params))
print()
0.970 (+/-0.021) for {'activation': 'tanh', 'alpha': 0.001, 'hidden_layer_sizes': 5, 'solver': 'adam'}
0.972 (+/-0.028) for {'activation': 'tanh', 'alpha': 0.001, 'hidden_layer_sizes': 5, 'solver': 'lbfgs'}
0.974 (+/-0.022) for {'activation': 'tanh', 'alpha': 0.001, 'hidden_layer_sizes': 30, 'solver': 'adam'}
0.974 (+/-0.020) for {'activation': 'tanh', 'alpha': 0.001, 'hidden_layer_sizes': 30, 'solver': 'lbfgs'}
0.977 (+/-0.025) for {'activation': 'tanh', 'alpha': 0.001, 'hidden_layer_sizes': 60, 'solver': 'adam'}
0.976 (+/-0.021) for {'activation': 'tanh', 'alpha': 0.001, 'hidden_layer_sizes': 60, 'solver': 'lbfgs'}
0.969 (+/-0.021) for {'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': 5, 'solver': 'adam'}
0.973 (+/-0.030) for {'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': 5, 'solver': 'lbfgs'}
0.974 (+/-0.021) for {'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': 30, 'solver': 'adam'}
0.979 (+/-0.026) for {'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': 30, 'solver': 'lbfgs'}
0.977 (+/-0.023) for {'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': 60, 'solver': 'adam'}
0.978 (+/-0.022) for {'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': 60, 'solver': 'lbfgs'}
0.957 (+/-0.028) for {'activation': 'tanh', 'alpha': 10.0, 'hidden_layer_sizes': 5, 'solver': 'adam'}
0.971 (+/-0.027) for {'activation': 'tanh', 'alpha': 10.0, 'hidden_layer_sizes': 5, 'solver': 'lbfgs'}
0.955 (+/-0.030) for {'activation': 'tanh', 'alpha': 10.0, 'hidden_layer_sizes': 30, 'solver': 'adam'}
0.972 (+/-0.028) for {'activation': 'tanh', 'alpha': 10.0, 'hidden_layer_sizes': 30, 'solver': 'lbfgs'}
0.956 (+/-0.029) for {'activation': 'tanh', 'alpha': 10.0, 'hidden_layer_sizes': 60, 'solver': 'adam'}
0.972 (+/-0.028) for {'activation': 'tanh', 'alpha': 10.0, 'hidden_layer_sizes': 60, 'solver': 'lbfgs'}
0.967 (+/-0.036) for {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': 5, 'solver': 'adam'}
0.975 (+/-0.015) for {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': 5, 'solver': 'lbfgs'}
0.977 (+/-0.023) for {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': 30, 'solver': 'adam'}
0.972 (+/-0.029) for {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': 30, 'solver': 'lbfgs'}
0.977 (+/-0.023) for {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': 60, 'solver': 'adam'}
0.974 (+/-0.022) for {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': 60, 'solver': 'lbfgs'}
0.967 (+/-0.036) for {'activation': 'relu', 'alpha': 0.1, 'hidden_layer_sizes': 5, 'solver': 'adam'}
0.972 (+/-0.021) for {'activation': 'relu', 'alpha': 0.1, 'hidden_layer_sizes': 5, 'solver': 'lbfgs'}
0.976 (+/-0.024) for {'activation': 'relu', 'alpha': 0.1, 'hidden_layer_sizes': 30, 'solver': 'adam'}
0.973 (+/-0.029) for {'activation': 'relu', 'alpha': 0.1, 'hidden_layer_sizes': 30, 'solver': 'lbfgs'}
0.977 (+/-0.022) for {'activation': 'relu', 'alpha': 0.1, 'hidden_layer_sizes': 60, 'solver': 'adam'}
0.976 (+/-0.021) for {'activation': 'relu', 'alpha': 0.1, 'hidden_layer_sizes': 60, 'solver': 'lbfgs'}
0.959 (+/-0.031) for {'activation': 'relu', 'alpha': 10.0, 'hidden_layer_sizes': 5, 'solver': 'adam'}
0.970 (+/-0.029) for {'activation': 'relu', 'alpha': 10.0, 'hidden_layer_sizes': 5, 'solver': 'lbfgs'}
0.958 (+/-0.028) for {'activation': 'relu', 'alpha': 10.0, 'hidden_layer_sizes': 30, 'solver': 'adam'}
0.973 (+/-0.024) for {'activation': 'relu', 'alpha': 10.0, 'hidden_layer_sizes': 30, 'solver': 'lbfgs'}
0.958 (+/-0.030) for {'activation': 'relu', 'alpha': 10.0, 'hidden_layer_sizes': 60, 'solver': 'adam'}
0.974 (+/-0.024) for {'activation': 'relu', 'alpha': 10.0, 'hidden_layer_sizes': 60, 'solver': 'lbfgs'}

In [ ]:
 
반응형
반응형