데이터 소개¶

Steel Plates Faults 데이터는 1941개의 샘플을 가지며 아래의 종속변수들과 나머지 설명변수들로 구성됩니다.

종속변수 (7개) - 어떠한 불량이 나타났는지를 나타내고, 다음과 같습니다.
- Pastry, Z_Scratch, K_Scatch, Stains, Dirtiness, Bumps, Other_Faults
설명변수 (27개) - 철판의 길이, 반짝이는 정도, 두께, 타입 등 등 다양한 변수들을 가집니다.
- 첫번째 칼럼 X_Minimum ~ 27번째 칼럼 SigmoidOfAreas
데이터출처: https://www.kaggle.com/mahsateimourikia/faults-nna/notebooks

제조 공정 데이터의 전반적 특성

제조 공정 데이터는 주로 불량률을 예측하여 불량을 일으키는 원인을 제거하거나 재고를 예측하여 수요에 맞는 생산을 진행하는 등의 목적성을 가집니다.
다른 데이터에 비하여 데이터를 얻는 과정이 자동화되어 있는 경우가 많아 데이터 퀄리티가 높은 편이며 결측치가 적은 경향성을 가집니다.

아래와 같이 데이터를 준비합니다.

#Cpu의 개수를 확인합니다. 
n_cpu=os.cpu_count()
print("The number of cpus: ",n_cpu)
n_thread=n_cpu*2
print("Expected number of threads:",n_thread)

The number of cpus:  8
Expected number of threads: 16

import os
import glob

os.listdir()

['.ipynb_checkpoints',
 'data',
 '[문제] Chapter 01_철판 제조 공정 데이터를 활용한 분류모형 생성 및 성능 비교-Copy1.ipynb',
 '[문제] Chapter 01_철판 제조 공정 데이터를 활용한 분류모형 생성 및 성능 비교.ipynb',
 '[해설] Chapter 01_철판 제조 공정 데이터를 활용한 분류모형 생성 및 성능 비교.ipynb']

import pandas as pd 
import numpy as np

# 데이터를 읽어옵니다.
df = pd.read_csv("data/Faults.NNA",  delimiter='\t', header=None)
df.head()

# 칼럼 레이블을 읽어와서 데이터 프레임의 칼럼명으로 지정합니다.
attributes_name=pd.read_csv("data/Faults27x7_var",  delimiter=' ', header=None)
df.columns=attributes_name[0]

df.columns

Index(['X_Minimum', 'X_Maximum', 'Y_Minimum', 'Y_Maximum', 'Pixels_Areas',
       'X_Perimeter', 'Y_Perimeter', 'Sum_of_Luminosity',
       'Minimum_of_Luminosity', 'Maximum_of_Luminosity', 'Length_of_Conveyer',
       'TypeOfSteel_A300', 'TypeOfSteel_A400', 'Steel_Plate_Thickness',
       'Edges_Index', 'Empty_Index', 'Square_Index', 'Outside_X_Index',
       'Edges_X_Index', 'Edges_Y_Index', 'Outside_Global_Index', 'LogOfAreas',
       'Log_X_Index', 'Log_Y_Index', 'Orientation_Index', 'Luminosity_Index',
       'SigmoidOfAreas', 'Pastry', 'Z_Scratch', 'K_Scatch', 'Stains',
       'Dirtiness', 'Bumps', 'Other_Faults'],
      dtype='object', name=0)

df.head()

df.describe()

종속변수가 7개니까 약간 원핫인코딩처럼 하나씩 1을 돌려가며 만들어준다는 뜻... 복잡¶

## 방법 1. 논리적 연산자 &를 활용하여 생성합니다.
conditions=[(df['Pastry'] == 1) & (df['Z_Scratch'] == 0)& (df['K_Scatch'] == 0)& (df['Stains'] == 0)& (df['Dirtiness'] == 0)& (df['Bumps'] == 0)& (df['Other_Faults'] == 0), 
            (df['Pastry'] == 0) & (df['Z_Scratch'] == 1)& (df['K_Scatch'] == 0)& (df['Stains'] == 0)& (df['Dirtiness'] == 0)& (df['Bumps'] == 0)& (df['Other_Faults'] == 0),
            (df['Pastry'] == 0) & (df['Z_Scratch'] == 0)& (df['K_Scatch'] == 1)& (df['Stains'] == 0)& (df['Dirtiness'] == 0)& (df['Bumps'] == 0)& (df['Other_Faults'] == 0),
            (df['Pastry'] == 0) & (df['Z_Scratch'] == 0)& (df['K_Scatch'] == 0)& (df['Stains'] == 1)& (df['Dirtiness'] == 0)& (df['Bumps'] == 0)& (df['Other_Faults'] == 0),
            (df['Pastry'] == 0) & (df['Z_Scratch'] == 0)& (df['K_Scatch'] == 0)& (df['Stains'] == 0)& (df['Dirtiness'] == 1)& (df['Bumps'] == 0)& (df['Other_Faults'] == 0),
            (df['Pastry'] == 0) & (df['Z_Scratch'] == 0)& (df['K_Scatch'] == 0)& (df['Stains'] == 0)& (df['Dirtiness'] == 0)& (df['Bumps'] == 1)& (df['Other_Faults'] == 0),
            (df['Pastry'] == 0) & (df['Z_Scratch'] == 0)& (df['K_Scatch'] == 0)& (df['Stains'] == 0)& (df['Dirtiness'] == 0)& (df['Bumps'] == 0)& (df['Other_Faults'] == 1)]

1은 True 0 은 False로 자동으로 바꿔주는 astype 함수를 사용¶

conditions

[0        True
 1        True
 2        True
 3        True
 4        True
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936     True
 1937     True
 1938     True
 1939     True
 1940     True
 Length: 1941, dtype: bool]

## 방법 2. pandas.Series.astype을 활용합니다.
conditions=[df['Pastry'].astype(bool),
            df['Z_Scratch'].astype(bool),
            df['K_Scatch'].astype(bool),
            df['Stains'].astype(bool),
            df['Dirtiness'].astype(bool),
            df['Bumps'].astype(bool),
            df['Other_Faults'].astype(bool)]

conditions

[0        True
 1        True
 2        True
 3        True
 4        True
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Name: Pastry, Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Name: Z_Scratch, Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Name: K_Scatch, Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Name: Stains, Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Name: Dirtiness, Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936    False
 1937    False
 1938    False
 1939    False
 1940    False
 Name: Bumps, Length: 1941, dtype: bool,
 0       False
 1       False
 2       False
 3       False
 4       False
         ...  
 1936     True
 1937     True
 1938     True
 1939     True
 1940     True
 Name: Other_Faults, Length: 1941, dtype: bool]

이 astype을 일괄적으로 적용하고 싶어 그런함수를 짠 것¶

## (문제) 방법 3. pandas.Series.astype과 map, lambda를 활용합니다
# conditions_bf에 각 변수들의 Seris로 list를 구성합니다.
# conditions_bf을 사용하고 map, lambda를 활용하여 conditions_bf의 각 원소에 astype 함수를 적용합니다.

conditions_bf=[
              df['Pastry'],
            df['Z_Scratch'],
            df['K_Scatch'],
            df['Stains'],
            df['Dirtiness'],
            df['Bumps'],
            df['Other_Faults'] 
]
conditions= list(map(lambda i: i.astype(bool), conditions_bf))

conditions_bf

[0       1
 1       1
 2       1
 3       1
 4       1
        ..
 1936    0
 1937    0
 1938    0
 1939    0
 1940    0
 Name: Pastry, Length: 1941, dtype: int64,
 0       0
 1       0
 2       0
 3       0
 4       0
        ..
 1936    0
 1937    0
 1938    0
 1939    0
 1940    0
 Name: Z_Scratch, Length: 1941, dtype: int64,
 0       0
 1       0
 2       0
 3       0
 4       0
        ..
 1936    0
 1937    0
 1938    0
 1939    0
 1940    0
 Name: K_Scatch, Length: 1941, dtype: int64,
 0       0
 1       0
 2       0
 3       0
 4       0
        ..
 1936    0
 1937    0
 1938    0
 1939    0
 1940    0
 Name: Stains, Length: 1941, dtype: int64,
 0       0
 1       0
 2       0
 3       0
 4       0
        ..
 1936    0
 1937    0
 1938    0
 1939    0
 1940    0
 Name: Dirtiness, Length: 1941, dtype: int64,
 0       0
 1       0
 2       0
 3       0
 4       0
        ..
 1936    0
 1937    0
 1938    0
 1939    0
 1940    0
 Name: Bumps, Length: 1941, dtype: int64,
 0       0
 1       0
 2       0
 3       0
 4       0
        ..
 1936    1
 1937    1
 1938    1
 1939    1
 1940    1
 Name: Other_Faults, Length: 1941, dtype: int64]

print(type(conditions))
print(type(conditions[0]))
print(len(conditions))
print(len(conditions[0]))

<class 'list'>
<class 'pandas.core.series.Series'>
7
1941

choices = ['Pastry', 'Z_Scratch', 'K_Scatch', 'Stains', 'Dirtiness', 'Bumps', 'Other_Faults']

choices

['Pastry',
 'Z_Scratch',
 'K_Scatch',
 'Stains',
 'Dirtiness',
 'Bumps',
 'Other_Faults']

df['class']=np.select(conditions,choices)

df['class']

0             Pastry
1             Pastry
2             Pastry
3             Pastry
4             Pastry
            ...     
1936    Other_Faults
1937    Other_Faults
1938    Other_Faults
1939    Other_Faults
1940    Other_Faults
Name: class, Length: 1941, dtype: object

결국 class가 타겟 값이자 multiple classification이 목적¶

df.tail(50)

EDA 결측치¶

df.isnull().sum()

0
X_Minimum                0
X_Maximum                0
Y_Minimum                0
Y_Maximum                0
Pixels_Areas             0
X_Perimeter              0
Y_Perimeter              0
Sum_of_Luminosity        0
Minimum_of_Luminosity    0
Maximum_of_Luminosity    0
Length_of_Conveyer       0
TypeOfSteel_A300         0
TypeOfSteel_A400         0
Steel_Plate_Thickness    0
Edges_Index              0
Empty_Index              0
Square_Index             0
Outside_X_Index          0
Edges_X_Index            0
Edges_Y_Index            0
Outside_Global_Index     0
LogOfAreas               0
Log_X_Index              0
Log_Y_Index              0
Orientation_Index        0
Luminosity_Index         0
SigmoidOfAreas           0
Pastry                   0
Z_Scratch                0
K_Scatch                 0
Stains                   0
Dirtiness                0
Bumps                    0
Other_Faults             0
class                    0
dtype: int64

제조공정특징 - null이 없음¶

말이 안되는 값이 있는지 살펴보기¶

round(df.describe(),2)

stain과 dirtiness가 너무적어 모델링시 퍼포먼스가 나오지 않을 확률이 높음¶

df['class'].value_counts()

Other_Faults    673
Bumps           402
K_Scatch        391
Z_Scratch       190
Pastry          158
Stains           72
Dirtiness        55
Name: class, dtype: int64

산점도를 통한 변수간의 상관관계 파악¶

import matplotlib.pyplot as plt

color_code = {'Pastry':'Red', 'Z_Scratch':'Blue', 'K_Scatch':'Green', 'Stains':'Black', 'Dirtiness':'Pink', 'Bumps':'Brown', 'Other_Faults':'Gold'}

color_code

{'Pastry': 'Red',
 'Z_Scratch': 'Blue',
 'K_Scatch': 'Green',
 'Stains': 'Black',
 'Dirtiness': 'Pink',
 'Bumps': 'Brown',
 'Other_Faults': 'Gold'}

get 함수 -> key를 쓰면 value 를 반환¶

color_list=[color_code.get(i) for i in df.loc[:,'class']]

color_list

['Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Red',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Blue',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Green',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Black',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Pink',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 'Brown',
 ...]

# (문제) pandas.plotting.scatter_matrix, 위에서 만든 color_list를 활용해 scatter plot을 그리고 대각원소에는 히스토그램을 출력해봅니다. figsize= [30,30], alpha=0.3,s = 50 으로 지정합니다.
pd.plotting.scatter_matrix(df.loc[:,df.columns!='class'], c=color_list, figsize= [30,30], alpha=0.3,s = 50, diagonal='hist')

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002247A7E9280>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002247A61E370>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002247A7FDEB0>,
        ...,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000022479F3F9A0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000022479D08E20>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000022478EB99D0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000022477A01070>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000022477F51340>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000022477AB2310>,
        ...,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002247B43E400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002247B465B80>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002247B49B340>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002247B4C4AC0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002247B4F82B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002247B51FA30>,
        ...,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002247BA6B7F0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002247BA94F70>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002247BAC8790>],
       ...,
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002240B5B5310>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240B5DEA90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240B614250>,
        ...,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240CB2AF10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240CB5D6D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240CB88E50>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002240CBBC610>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240CBE5D90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240CC1A550>,
        ...,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240D16E250>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240D1979D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240D1CB190>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002240D1F4910>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240D21C130>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240D253850>,
        ...,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240D7A4550>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240D7CFCD0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002240D803490>]],
      dtype=object)

scatter plot - 변수들 상관관계 및 선형관계인지아닌지 변수의 조합 (범주의 구분)¶

import seaborn as sns
sns.set_style('white')

g=sns.factorplot(x='class',data=df,kind='count',palette='YlGnBu',size=6)
g.ax.xaxis.set_label_text("Type of Defect")
g.ax.yaxis.set_label_text("Count")
g.ax.set_title("The number of Defects by Defect type")

C:\Users\Administrator\anaconda3\lib\site-packages\seaborn\categorical.py:3714: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.
  warnings.warn(msg)
C:\Users\Administrator\anaconda3\lib\site-packages\seaborn\categorical.py:3720: UserWarning: The `size` parameter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)

Text(0.5, 1.0, 'The number of Defects by Defect type')

결함의 종류이므로 -> type of defect 로 정의¶

bar 값에 count값 출력하기¶

# 이전 cell에서 완성한 코드를 복사 붙여넣기 합니다.
g=sns.factorplot(x='class',data=df,kind='count',palette='YlGnBu',size=6)
g.ax.xaxis.set_label_text("Type of Defect")
g.ax.yaxis.set_label_text("Count")
g.ax.set_title("The number of Defects by Defect type")
# (문제) Barplot의 bar 상단에 값을 text로 달아줍니다.
for p in g.ax.patches:
  g.ax.annotate((p.get_height()),(p.get_x()+0.2,p.get_height()+10))

C:\Users\Administrator\anaconda3\lib\site-packages\seaborn\categorical.py:3714: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.
  warnings.warn(msg)
C:\Users\Administrator\anaconda3\lib\site-packages\seaborn\categorical.py:3720: UserWarning: The `size` parameter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)

상관계수를 활용한 변수간의 상관관계 파악¶

df_corTarget = df[['X_Minimum', 'X_Maximum', 'Y_Minimum', 'Y_Maximum', 'Pixels_Areas',
       'X_Perimeter', 'Y_Perimeter', 'Sum_of_Luminosity',
       'Minimum_of_Luminosity', 'Maximum_of_Luminosity', 'Length_of_Conveyer',
       'TypeOfSteel_A300', 'TypeOfSteel_A400', 'Steel_Plate_Thickness',
       'Edges_Index', 'Empty_Index', 'Square_Index', 'Outside_X_Index',
       'Edges_X_Index', 'Edges_Y_Index', 'Outside_Global_Index', 'LogOfAreas',
       'Log_X_Index', 'Log_Y_Index', 'Orientation_Index', 'Luminosity_Index',
       'SigmoidOfAreas']]

corr=df_corTarget.corr()
corr

vmax vmin 은 가장 진한 값이 1인지 -1인지 정하는것¶

# heatmap을 그리기 위한 파라미터들 설정
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(1,200, as_cmap=True)

# (문제) 저장해둔 corr과 mask, cmap을 활용하여 correlation을 표현하는 heatmap을 그립니다. correlation에 맞게 최대, 최소, 중간값을 지정해줍니다.
# linewidths=2로 설정합니다. 그림 크기는 figsize=(11,9)로 설정합니다.
sns.heatmap(corr,mask=mask,cmap=cmap,vmax=1,vmin=-1,center=0,linewidths=2)

<matplotlib.axes._subplots.AxesSubplot at 0x2241ec867c0>

Training, Test set 분리하기 y를 k_scatch로 지정해보기¶

x = df[['X_Minimum', 'X_Maximum', 'Y_Minimum', 'Y_Maximum', 'Pixels_Areas',
       'X_Perimeter', 'Y_Perimeter', 'Sum_of_Luminosity',
       'Minimum_of_Luminosity', 'Maximum_of_Luminosity', 'Length_of_Conveyer',
       'TypeOfSteel_A300',  'Steel_Plate_Thickness',
       'Edges_Index', 'Empty_Index', 'Square_Index', 'Outside_X_Index',
       'Edges_X_Index', 'Edges_Y_Index', 'Outside_Global_Index', 'LogOfAreas',
       'Log_X_Index', 'Log_Y_Index', 'Orientation_Index', 'Luminosity_Index',
       'SigmoidOfAreas']]
y = df['K_Scatch']

from sklearn.model_selection import train_test_split
from scipy.stats import zscore

# (문제) sklearn.model_selection.train_test_split을 활용하여, x_train, x_test, y_train, y_test로 데이터를 나눕니다
# 그 비율은 8:2로 합니다. y값에 따라 stratify하여 나눕니다.
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=1, stratify=y)

z 스코어는 정규화하는것 (표준화)¶

# (문제) pandas.DataFrame.apply를 활용하여  x_train과 x_test를 표준화합니다.
x_train = x_train.apply(zscore)
x_test = x_test.apply(zscore)

round(x_train.describe(),2)

9. [로지스틱 회귀분석] 로지스틱 기본 모형 만들기¶

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn import metrics

#### liblinear 를 해야 릿지 라쏘를 구하는 것이 가능

lm=LogisticRegression(solver='liblinear')

로지스틱에서 고려해야할 Penalty의 형태 (Ridge, Lasso), regularization parameter range를 설정하여 이를 parameters에 dictionary 형태로 저장합니다.¶

parameters={'penalty':['l1','l2'],'C':[0.01,0.1,0.5,0.9,1,5,10],'tol':[1e-4,1e-2,1,1e2]}

그리드서치 - > 최적의 파라미터 구하기위해¶

GSLR=GridSearchCV(lm,parameters,cv=10,n_jobs=n_thread,scoring="accuracy")

GSLR.fit(x_train,y_train)

GridSearchCV(cv=10, estimator=LogisticRegression(solver='liblinear'), n_jobs=16,
             param_grid={'C': [0.01, 0.1, 0.5, 0.9, 1, 5, 10],
                         'penalty': ['l1', 'l2'],
                         'tol': [0.0001, 0.01, 1, 100.0]},
             scoring='accuracy')

원래는 validation 을 따로 떼서 튜닝을 해야하는데 validation set을 따로 떼면 Training set 손실이 일어나므로 그렇게 하지 않고 과적합도 방지하기위해 CV를 사용

# 최적의 파라미터 값 및 정확도 (Accuracy) 출력
print('final params', GSLR.best_params_)   
print('best score', GSLR.best_score_)

final params {'C': 1, 'penalty': 'l2', 'tol': 0.0001}
best score 0.9722911497105045

predicted=GSLR.predict(x_test)

predicted

array([0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

cMatrix = confusion_matrix(y_test,predicted)
print(cMatrix)
print("\n Accuracy:", GSLR.score(x_test,y_test))

[[305   6]
 [  6  72]]

 Accuracy: 0.9691516709511568

#  sklearn.metrics.classification_report를 활용하여 report를 출력합니다.
print(metrics.classification_report(y_test,predicted))

              precision    recall  f1-score   support

           0       0.98      0.98      0.98       311
           1       0.92      0.92      0.92        78

    accuracy                           0.97       389
   macro avg       0.95      0.95      0.95       389
weighted avg       0.97      0.97      0.97       389

# Cross validation 과정에서 계산된 정확도 값들을 출력해줍니다.
means = GSLR.cv_results_['mean_test_score']
stds = GSLR.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, GSLR.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
          % (mean, std * 2, params))
print()

0.945 (+/-0.037) for {'C': 0.01, 'penalty': 'l1', 'tol': 0.0001}
0.945 (+/-0.037) for {'C': 0.01, 'penalty': 'l1', 'tol': 0.01}
0.945 (+/-0.037) for {'C': 0.01, 'penalty': 'l1', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.01, 'penalty': 'l1', 'tol': 100.0}
0.949 (+/-0.031) for {'C': 0.01, 'penalty': 'l2', 'tol': 0.0001}
0.949 (+/-0.031) for {'C': 0.01, 'penalty': 'l2', 'tol': 0.01}
0.952 (+/-0.034) for {'C': 0.01, 'penalty': 'l2', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.01, 'penalty': 'l2', 'tol': 100.0}
0.964 (+/-0.028) for {'C': 0.1, 'penalty': 'l1', 'tol': 0.0001}
0.964 (+/-0.027) for {'C': 0.1, 'penalty': 'l1', 'tol': 0.01}
0.955 (+/-0.019) for {'C': 0.1, 'penalty': 'l1', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.1, 'penalty': 'l1', 'tol': 100.0}
0.966 (+/-0.021) for {'C': 0.1, 'penalty': 'l2', 'tol': 0.0001}
0.966 (+/-0.021) for {'C': 0.1, 'penalty': 'l2', 'tol': 0.01}
0.957 (+/-0.030) for {'C': 0.1, 'penalty': 'l2', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.1, 'penalty': 'l2', 'tol': 100.0}
0.969 (+/-0.028) for {'C': 0.5, 'penalty': 'l1', 'tol': 0.0001}
0.969 (+/-0.022) for {'C': 0.5, 'penalty': 'l1', 'tol': 0.01}
0.958 (+/-0.032) for {'C': 0.5, 'penalty': 'l1', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.5, 'penalty': 'l1', 'tol': 100.0}
0.968 (+/-0.023) for {'C': 0.5, 'penalty': 'l2', 'tol': 0.0001}
0.968 (+/-0.022) for {'C': 0.5, 'penalty': 'l2', 'tol': 0.01}
0.957 (+/-0.032) for {'C': 0.5, 'penalty': 'l2', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.5, 'penalty': 'l2', 'tol': 100.0}
0.969 (+/-0.030) for {'C': 0.9, 'penalty': 'l1', 'tol': 0.0001}
0.969 (+/-0.028) for {'C': 0.9, 'penalty': 'l1', 'tol': 0.01}
0.959 (+/-0.024) for {'C': 0.9, 'penalty': 'l1', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.9, 'penalty': 'l1', 'tol': 100.0}
0.972 (+/-0.023) for {'C': 0.9, 'penalty': 'l2', 'tol': 0.0001}
0.970 (+/-0.023) for {'C': 0.9, 'penalty': 'l2', 'tol': 0.01}
0.957 (+/-0.032) for {'C': 0.9, 'penalty': 'l2', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.9, 'penalty': 'l2', 'tol': 100.0}
0.970 (+/-0.030) for {'C': 1, 'penalty': 'l1', 'tol': 0.0001}
0.970 (+/-0.027) for {'C': 1, 'penalty': 'l1', 'tol': 0.01}
0.957 (+/-0.035) for {'C': 1, 'penalty': 'l1', 'tol': 1}
0.798 (+/-0.005) for {'C': 1, 'penalty': 'l1', 'tol': 100.0}
0.972 (+/-0.024) for {'C': 1, 'penalty': 'l2', 'tol': 0.0001}
0.972 (+/-0.025) for {'C': 1, 'penalty': 'l2', 'tol': 0.01}
0.957 (+/-0.032) for {'C': 1, 'penalty': 'l2', 'tol': 1}
0.798 (+/-0.005) for {'C': 1, 'penalty': 'l2', 'tol': 100.0}
0.970 (+/-0.027) for {'C': 5, 'penalty': 'l1', 'tol': 0.0001}
0.971 (+/-0.028) for {'C': 5, 'penalty': 'l1', 'tol': 0.01}
0.952 (+/-0.036) for {'C': 5, 'penalty': 'l1', 'tol': 1}
0.798 (+/-0.005) for {'C': 5, 'penalty': 'l1', 'tol': 100.0}
0.972 (+/-0.028) for {'C': 5, 'penalty': 'l2', 'tol': 0.0001}
0.971 (+/-0.028) for {'C': 5, 'penalty': 'l2', 'tol': 0.01}
0.958 (+/-0.030) for {'C': 5, 'penalty': 'l2', 'tol': 1}
0.798 (+/-0.005) for {'C': 5, 'penalty': 'l2', 'tol': 100.0}
0.972 (+/-0.025) for {'C': 10, 'penalty': 'l1', 'tol': 0.0001}
0.970 (+/-0.026) for {'C': 10, 'penalty': 'l1', 'tol': 0.01}
0.961 (+/-0.043) for {'C': 10, 'penalty': 'l1', 'tol': 1}
0.798 (+/-0.005) for {'C': 10, 'penalty': 'l1', 'tol': 100.0}
0.972 (+/-0.024) for {'C': 10, 'penalty': 'l2', 'tol': 0.0001}
0.971 (+/-0.024) for {'C': 10, 'penalty': 'l2', 'tol': 0.01}
0.958 (+/-0.030) for {'C': 10, 'penalty': 'l2', 'tol': 1}
0.798 (+/-0.005) for {'C': 10, 'penalty': 'l2', 'tol': 100.0}

[의사결정나무] 의사결정나무 기본 모형 만들기¶

from sklearn.tree import DecisionTreeClassifier

# (문제) 의사결정나무 모형을 만들어 dt에 저장합니다.
dt=DecisionTreeClassifier()

# (문제) 의사결정나무에서 고려해야할 criterion, min_samples_split, max_depth, min_samples_leaf, max_features 등을 고려하여 Grid search를 수행합니다.
# GridSearchCV의 옵션은 cv=10, n_jobs=n_thread, scoreing="accuracy"로 설정합니다.
parameters={'criterion':['gini','entropy'],'min_samples_split':[2,5,10,15], 'max_depth':[None,2],'min_samples_leaf':[1,3,10,15],'max_features':[None,'sqrt','log2']}

GSDT=GridSearchCV(dt,parameters,cv=10,n_jobs=n_thread,scoring="accuracy")
GSDT.fit(x_train,y_train)

GridSearchCV(cv=10, estimator=DecisionTreeClassifier(), n_jobs=16,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [None, 2],
                         'max_features': [None, 'sqrt', 'log2'],
                         'min_samples_leaf': [1, 3, 10, 15],
                         'min_samples_split': [2, 5, 10, 15]},
             scoring='accuracy')

print('final params',GSDT.best_params_)
print('ACC.',GSDT.best_score_)

final params {'criterion': 'entropy', 'max_depth': None, 'max_features': None, 'min_samples_leaf': 3, 'min_samples_split': 10}
ACC. 0.9787386269644335

# (문제) predict 함수를 활용하여 예측 값을 구해 이를 predicted 에 저장하고 이를 출력하며 classification_report 또한 출력합니다.
predicted=GSDT.predict(x_test)
cMatrix = confusion_matrix(y_test,predicted)
print(cMatrix)
print(round(GSDT.score(x_test,y_test),3))
print(metrics.classification_report(y_test,predicted))

[[308   3]
 [  8  70]]
0.972
              precision    recall  f1-score   support

           0       0.97      0.99      0.98       311
           1       0.96      0.90      0.93        78

    accuracy                           0.97       389
   macro avg       0.97      0.94      0.95       389
weighted avg       0.97      0.97      0.97       389

# Train에서의 종속변수의 분포
print(y_train.value_counts())

0    1239
1     313
Name: K_Scatch, dtype: int64

# 트리 시각화
import graphviz
dt2=DecisionTreeClassifier(criterion='entropy',max_depth=None,max_features=None,min_samples_leaf=1,min_samples_split=5)
dt2.fit(x_train,y_train)
dot_data=tree.export_graphviz(dt2,feature_names=x_train.columns,filled=True,rounded=True)
graph=graphviz.Source(dot_data)
graph

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-111-ab07a0e6b331> in <module>
      3 dt2=DecisionTreeClassifier(criterion='entropy',max_depth=None,max_features=None,min_samples_leaf=1,min_samples_split=5)
      4 dt2.fit(x_train,y_train)
----> 5 dot_data=tree.export_graphviz(dt2,feature_names=x_train.columns,filled=True,rounded=True)
      6 graph=graphviz.Source(dot_data)
      7 graph

NameError: name 'tree' is not defined

Random Forest¶

Random Forest는 아래의 Bagging과 Drop-out을 활용하여 의사결정나무의 변동성을 완화시키고 예측력을 높인 모델이다.
- Bootstrapping: 복원추출을 통하여 샘플 구성이 조금씩 다른 여러 데이터셋을 생성해냄.
- Aggregating: 여러 모형의 결과를 통합하여 모형의 변동성을 낮춤.
- Drop-out: Tree를 구성할 때 변수를 일부 탈락시킴. Tree간의 correlation을 감소시켜 이 또한 모형의 변동성을 낮춤.

from sklearn.ensemble import RandomForestClassifier

rf=RandomForestClassifier()

# Random Forest에서 고려해야할 n_estimators, min_samples_split, max_depth, min_samples_leaf, max_features 등을 고려하여 Grid search를 수행합니다.
# GridSearchCV의 옵션은 cv=10, n_jobs=n_thread, scoreing="accuracy"로 설정합니다.
parameters={'n_estimators':[50,100],'criterion':['entropy'],'min_samples_split':[2,5],'max_depth':[None,2],'min_samples_leaf':[1,3,10],'max_features':['sqrt']}
GSRF=GridSearchCV(rf,parameters,cv=10,n_jobs=n_thread,scoring="accuracy")
GSRF.fit(x_train,y_train)

GridSearchCV(cv=10, estimator=RandomForestClassifier(), n_jobs=16,
             param_grid={'criterion': ['entropy'], 'max_depth': [None, 2],
                         'max_features': ['sqrt'],
                         'min_samples_leaf': [1, 3, 10],
                         'min_samples_split': [2, 5],
                         'n_estimators': [50, 100]},
             scoring='accuracy')

print('final params',GSRF.best_params_)
print('best score',GSRF.best_score_)

final params {'criterion': 'entropy', 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
best score 0.9838875103391231

predicted=GSRF.predict(x_test)
cMatrix=confusion_matrix(y_test,predicted)
print(cMatrix)
print(metrics.classification_report(y_test,predicted))

[[311   0]
 [  5  73]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       311
           1       1.00      0.94      0.97        78

    accuracy                           0.99       389
   macro avg       0.99      0.97      0.98       389
weighted avg       0.99      0.99      0.99       389

Support Vector Machine¶

노란색 margin을 최대화하는 boundary를 찾는 것이 목표.
Error를 허용하는 정도를 C로 표현한다.
- C가 크면 Error를 많이 허용하고, C가 작을 수록 Error를 적게 허용한다.

from sklearn import svm

svc=svm.SVC()

# (문제) Support Vector Machine에서 고려해야할 C, kernel, gamma 등을 고려하여 Grid search를 수행합니다.
# GridSearchCV의 옵션은 cv=10, n_jobs=n_thread, scoreing="accuracy"로 설정합니다.
parameters={'C':[0.01,0.1,0.5,0.9,1,5,10],'kernel':['linear','rbf','poly'],'gamma':[0.1,1,10]}
GS_SVM=GridSearchCV(svc,parameters,cv=10,n_jobs=n_thread,scoring="accuracy")
GS_SVM.fit(x_train,y_train)

GridSearchCV(cv=10, estimator=SVC(), n_jobs=16,
             param_grid={'C': [0.01, 0.1, 0.5, 0.9, 1, 5, 10],
                         'gamma': [0.1, 1, 10],
                         'kernel': ['linear', 'rbf', 'poly']},
             scoring='accuracy')

# (문제) Support Vector Machine에서 고려해야할 C, kernel, gamma 등을 고려하여 Grid search를 수행합니다.
# GridSearchCV의 옵션은 cv=10, n_jobs=n_thread, scoreing="accuracy"로 설정합니다.
parameters={'C':[0.01,0.1,0.5,0.9,1,5,10],'kernel':['linear','rbf','poly'],'gamma':[0.1,1,10]}
GS_SVM=GridSearchCV(svc,parameters,cv=10,n_jobs=n_thread,scoring="accuracy")
GS_SVM.fit(x_train,y_train)

GridSearchCV(cv=10, estimator=SVC(), n_jobs=16,
             param_grid={'C': [0.01, 0.1, 0.5, 0.9, 1, 5, 10],
                         'gamma': [0.1, 1, 10],
                         'kernel': ['linear', 'rbf', 'poly']},
             scoring='accuracy')

print('final params',GS_SVM.best_params_)
print('best score',GS_SVM.best_score_)

final params {'C': 5, 'gamma': 0.1, 'kernel': 'rbf'}
best score 0.9826137303556658

# (문제) predict 함수를 활용하여 예측 값을 구해 이를 predicted 에 저장하고 이를 출력하며 classification_report 또한 출력합니다.
predicted=GS_SVM.predict(x_test)
cMatrix=confusion_matrix(y_test,predicted)
print(cMatrix)
print(metrics.classification_report(y_test,predicted))

[[310   1]
 [  7  71]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       311
           1       0.99      0.91      0.95        78

    accuracy                           0.98       389
   macro avg       0.98      0.95      0.97       389
weighted avg       0.98      0.98      0.98       389

신경망 모형¶

신경망 모형은 위와 같이 입력 데이터를 종합하여 결과값을 내는 구조를 가진 Perceptron을 중첩시키고 혼합시킨 구조이다. 아래와 같이 두 부분으로 나누어볼 수 있다.
- 입력값들의 선형합 구조인 transfer function
- activation function f()
이 때 입력값은 다른 perceptron의 출력값이 될 수 있으며 이것이 중첩되면 아래와 같이 나타날 수 있으며 이를 신경망 모형이라 한다.
- Input Layer: 입력 데이터가 위치하는 layer.
- Hidden Layer: 입력 데이터 혹은 또 다른 hidden layer의 출력값을 입력값으로 하는 perceptron이 위치하는 layer.
- Output Layer:마지막 hidden layer의 출력값을 입력값고 출력함수의 결과를 얻은 노드로 구성된 layer.

from sklearn.neural_network import MLPClassifier

# (문제) 신경망 모형을 만들어 ann_model에 저장합니다.
nn_model=MLPClassifier(random_state=1)

x_train.shape

(1552, 26)

# (문제) 신경망 모형에서 고려해야할 alpha, hidden_layer_sizes, activation등을 고려하여 Grid search를 수행합니다.
# GridSearchCV의 옵션은 cv=10, n_jobs=n_thread, scoreing="accuracy"로 설정합니다.
parameters={'alpha':[1e-3,1e-1,1e1],'hidden_layer_sizes':[(5),(30),(60)],'activation':['tanh','relu'],'solver':['adam','lbfgs']}
GS_NN=GridSearchCV(nn_model,parameters,cv=10,n_jobs=n_thread,scoring="accuracy")
GS_NN.fit(x_train,y_train)

GridSearchCV(cv=10, estimator=MLPClassifier(random_state=1), n_jobs=16,
             param_grid={'activation': ['tanh', 'relu'],
                         'alpha': [0.001, 0.1, 10.0],
                         'hidden_layer_sizes': [5, 30, 60],
                         'solver': ['adam', 'lbfgs']},
             scoring='accuracy')

print('final params', GS_NN.best_params_)
print('best score', GS_NN.best_score_)

final params {'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': 30, 'solver': 'lbfgs'}
best score 0.9793796526054592

means = GS_NN.cv_results_['mean_test_score']
stds = GS_NN.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, GS_NN.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
          % (mean, std * 2, params))
print()

0.970 (+/-0.021) for {'activation': 'tanh', 'alpha': 0.001, 'hidden_layer_sizes': 5, 'solver': 'adam'}
0.972 (+/-0.028) for {'activation': 'tanh', 'alpha': 0.001, 'hidden_layer_sizes': 5, 'solver': 'lbfgs'}
0.974 (+/-0.022) for {'activation': 'tanh', 'alpha': 0.001, 'hidden_layer_sizes': 30, 'solver': 'adam'}
0.974 (+/-0.020) for {'activation': 'tanh', 'alpha': 0.001, 'hidden_layer_sizes': 30, 'solver': 'lbfgs'}
0.977 (+/-0.025) for {'activation': 'tanh', 'alpha': 0.001, 'hidden_layer_sizes': 60, 'solver': 'adam'}
0.976 (+/-0.021) for {'activation': 'tanh', 'alpha': 0.001, 'hidden_layer_sizes': 60, 'solver': 'lbfgs'}
0.969 (+/-0.021) for {'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': 5, 'solver': 'adam'}
0.973 (+/-0.030) for {'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': 5, 'solver': 'lbfgs'}
0.974 (+/-0.021) for {'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': 30, 'solver': 'adam'}
0.979 (+/-0.026) for {'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': 30, 'solver': 'lbfgs'}
0.977 (+/-0.023) for {'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': 60, 'solver': 'adam'}
0.978 (+/-0.022) for {'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': 60, 'solver': 'lbfgs'}
0.957 (+/-0.028) for {'activation': 'tanh', 'alpha': 10.0, 'hidden_layer_sizes': 5, 'solver': 'adam'}
0.971 (+/-0.027) for {'activation': 'tanh', 'alpha': 10.0, 'hidden_layer_sizes': 5, 'solver': 'lbfgs'}
0.955 (+/-0.030) for {'activation': 'tanh', 'alpha': 10.0, 'hidden_layer_sizes': 30, 'solver': 'adam'}
0.972 (+/-0.028) for {'activation': 'tanh', 'alpha': 10.0, 'hidden_layer_sizes': 30, 'solver': 'lbfgs'}
0.956 (+/-0.029) for {'activation': 'tanh', 'alpha': 10.0, 'hidden_layer_sizes': 60, 'solver': 'adam'}
0.972 (+/-0.028) for {'activation': 'tanh', 'alpha': 10.0, 'hidden_layer_sizes': 60, 'solver': 'lbfgs'}
0.967 (+/-0.036) for {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': 5, 'solver': 'adam'}
0.975 (+/-0.015) for {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': 5, 'solver': 'lbfgs'}
0.977 (+/-0.023) for {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': 30, 'solver': 'adam'}
0.972 (+/-0.029) for {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': 30, 'solver': 'lbfgs'}
0.977 (+/-0.023) for {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': 60, 'solver': 'adam'}
0.974 (+/-0.022) for {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': 60, 'solver': 'lbfgs'}
0.967 (+/-0.036) for {'activation': 'relu', 'alpha': 0.1, 'hidden_layer_sizes': 5, 'solver': 'adam'}
0.972 (+/-0.021) for {'activation': 'relu', 'alpha': 0.1, 'hidden_layer_sizes': 5, 'solver': 'lbfgs'}
0.976 (+/-0.024) for {'activation': 'relu', 'alpha': 0.1, 'hidden_layer_sizes': 30, 'solver': 'adam'}
0.973 (+/-0.029) for {'activation': 'relu', 'alpha': 0.1, 'hidden_layer_sizes': 30, 'solver': 'lbfgs'}
0.977 (+/-0.022) for {'activation': 'relu', 'alpha': 0.1, 'hidden_layer_sizes': 60, 'solver': 'adam'}
0.976 (+/-0.021) for {'activation': 'relu', 'alpha': 0.1, 'hidden_layer_sizes': 60, 'solver': 'lbfgs'}
0.959 (+/-0.031) for {'activation': 'relu', 'alpha': 10.0, 'hidden_layer_sizes': 5, 'solver': 'adam'}
0.970 (+/-0.029) for {'activation': 'relu', 'alpha': 10.0, 'hidden_layer_sizes': 5, 'solver': 'lbfgs'}
0.958 (+/-0.028) for {'activation': 'relu', 'alpha': 10.0, 'hidden_layer_sizes': 30, 'solver': 'adam'}
0.973 (+/-0.024) for {'activation': 'relu', 'alpha': 10.0, 'hidden_layer_sizes': 30, 'solver': 'lbfgs'}
0.958 (+/-0.030) for {'activation': 'relu', 'alpha': 10.0, 'hidden_layer_sizes': 60, 'solver': 'adam'}
0.974 (+/-0.024) for {'activation': 'relu', 'alpha': 10.0, 'hidden_layer_sizes': 60, 'solver': 'lbfgs'}

	0	1	2	3	4	5	6	7	8	9	...	24	25	26	27
0	42	50	270900	270944	267	17	44	24220	76	108	...	0.8182	-0.2913	0.5822	1
1	645	651	2538079	2538108	108	10	30	11397	84	123	...	0.7931	-0.1756	0.2984	1
2	829	835	1553913	1553931	71	8	19	7972	99	125	...	0.6667	-0.1228	0.2150	1
3	853	860	369370	369415	176	13	45	18996	99	126	...	0.8444	-0.1568	0.5212	1
4	1289	1306	498078	498335	2409	60	260	246930	37	126	...	0.9338	-0.1992	1.0000	1

	X_Minimum	X_Maximum	Y_Minimum	Y_Maximum	Pixels_Areas	X_Perimeter	Y_Perimeter	Sum_of_Luminosity	Minimum_of_Luminosity	Maximum_of_Luminosity	...	Orientation_Index	Luminosity_Index	SigmoidOfAreas	Pastry
0	42	50	270900	270944	267	17	44	24220	76	108	...	0.8182	-0.2913	0.5822	1
1	645	651	2538079	2538108	108	10	30	11397	84	123	...	0.7931	-0.1756	0.2984	1
2	829	835	1553913	1553931	71	8	19	7972	99	125	...	0.6667	-0.1228	0.2150	1
3	853	860	369370	369415	176	13	45	18996	99	126	...	0.8444	-0.1568	0.5212	1
4	1289	1306	498078	498335	2409	60	260	246930	37	126	...	0.9338	-0.1992	1.0000	1

	X_Minimum	X_Maximum	Y_Minimum	Y_Maximum	Pixels_Areas	X_Perimeter	Y_Perimeter	Sum_of_Luminosity	Minimum_of_Luminosity	Maximum_of_Luminosity	...	Orientation_Index	Luminosity_Index	SigmoidOfAreas	Pastry	Z_Scratch	K_Scatch	Stains	Dirtiness	Bumps	Other_Faults
count	1941.000000	1941.000000	1.941000e+03	1.941000e+03	1941.000000	1941.000000	1941.000000	1.941000e+03	1941.000000	1941.000000	...	1941.000000	1941.000000	1941.000000	1941.000000	1941.000000	1941.000000	1941.000000	1941.000000	1941.000000	1941.000000
mean	571.136012	617.964451	1.650685e+06	1.650739e+06	1893.878413	111.855229	82.965997	2.063121e+05	84.548686	130.193715	...	0.083288	-0.131305	0.585420	0.081401	0.097888	0.201443	0.037094	0.028336	0.207110	0.346728
std	520.690671	497.627410	1.774578e+06	1.774590e+06	5168.459560	301.209187	426.482879	5.122936e+05	32.134276	18.690992	...	0.500868	0.148767	0.339452	0.273521	0.297239	0.401181	0.189042	0.165973	0.405339	0.476051
min	0.000000	4.000000	6.712000e+03	6.724000e+03	2.000000	2.000000	1.000000	2.500000e+02	0.000000	37.000000	...	-0.991000	-0.998900	0.119000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	51.000000	192.000000	4.712530e+05	4.712810e+05	84.000000	15.000000	13.000000	9.522000e+03	63.000000	124.000000	...	-0.333300	-0.195000	0.248200	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	435.000000	467.000000	1.204128e+06	1.204136e+06	174.000000	26.000000	25.000000	1.920200e+04	90.000000	127.000000	...	0.095200	-0.133000	0.506300	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	1053.000000	1072.000000	2.183073e+06	2.183084e+06	822.000000	84.000000	83.000000	8.301100e+04	106.000000	140.000000	...	0.511600	-0.066600	0.999800	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000
max	1705.000000	1713.000000	1.298766e+07	1.298769e+07	152655.000000	10449.000000	18152.000000	1.159141e+07	203.000000	253.000000	...	0.991700	0.642100	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

	X_Minimum	X_Maximum	Y_Minimum	Y_Maximum	Pixels_Areas	X_Perimeter	Y_Perimeter	Sum_of_Luminosity	Minimum_of_Luminosity	Maximum_of_Luminosity	...	Luminosity_Index	SigmoidOfAreas	Other_Faults	class
1891	1292	1372	1506506	1506552	1973	134	80	240091	105	140	...	-0.0493	1.0000	1	Other_Faults
1892	123	140	1477978	1477986	69	19	12	8701	113	142	...	-0.0148	0.2482	1	Other_Faults
1893	1551	1576	2071735	2071746	143	36	25	17042	109	135	...	-0.0689	0.4547	1	Other_Faults
1894	1379	1396	2128940	2128956	126	33	22	16498	116	148	...	0.0229	0.4498	1	Other_Faults
1895	773	801	2240409	2240422	207	34	23	26766	114	143	...	0.0102	0.6015	1	Other_Faults
1896	547	588	2084580	2084619	860	57	44	95173	91	132	...	-0.1354	0.9998	1	Other_Faults
1897	1557	1584	2375780	2375805	339	79	40	39556	104	132	...	-0.0884	0.9231	1	Other_Faults
1898	691	710	664879	664883	55	19	6	7211	114	150	...	0.0243	0.1812	1	Other_Faults
1899	1398	1445	684296	684315	381	91	47	48528	111	143	...	-0.0049	0.9809	1	Other_Faults
1900	900	916	693632	693638	53	17	8	6697	110	143	...	-0.0128	0.2018	1	Other_Faults
1901	832	848	712437	712446	67	29	12	8140	102	141	...	-0.0508	0.2583	1	Other_Faults
1902	1027	1048	787015	787022	76	22	10	9934	109	148	...	0.0212	0.2621	1	Other_Faults
1903	1275	1300	883635	883640	73	28	14	9247	111	143	...	-0.0104	0.2348	1	Other_Faults
1904	983	1005	895625	895642	102	42	22	12942	114	141	...	-0.0087	0.6173	1	Other_Faults
1905	374	382	303204	303209	24	10	5	3166	124	140	...	0.0306	0.1483	1	Other_Faults
1906	196	209	157035	157044	60	20	14	7508	117	135	...	-0.0224	0.2253	1	Other_Faults
1907	122	146	203252	203261	95	33	22	12566	124	143	...	0.0334	0.3601	1	Other_Faults
1908	189	203	238753	238766	63	20	14	8140	118	141	...	0.0094	0.3097	1	Other_Faults
1909	213	229	260092	260110	137	50	27	17621	120	140	...	0.0049	0.4763	1	Other_Faults
1910	345	362	283416	283428	60	23	18	7914	124	143	...	0.0305	0.3419	1	Other_Faults
1911	121	148	373848	373861	149	49	29	19691	122	143	...	0.0325	0.5805	1	Other_Faults
1912	182	211	402237	402266	268	98	55	35565	125	143	...	0.0368	0.9732	1	Other_Faults
1913	273	298	473262	473272	94	30	18	11877	117	143	...	-0.0129	0.4138	1	Other_Faults
1914	389	406	516427	516436	67	22	16	8596	120	143	...	0.0023	0.2699	1	Other_Faults
1915	856	871	531684	531693	54	23	16	7145	125	143	...	0.0337	0.2469	1	Other_Faults
1916	767	788	566837	566850	83	28	20	10345	115	141	...	-0.0263	0.4514	1	Other_Faults
1917	154	169	260124	260136	75	27	17	9948	125	143	...	0.0362	0.3068	1	Other_Faults
1918	275	299	1133715	1133734	318	39	32	39911	116	140	...	-0.0195	0.7359	1	Other_Faults
1919	271	291	1488866	1488885	204	38	22	23155	105	126	...	-0.1132	0.6268	1	Other_Faults
1920	214	247	1538494	1538509	209	66	46	27037	118	143	...	0.0106	0.7833	1	Other_Faults
1921	270	301	1563477	1563499	329	61	37	41053	115	140	...	-0.0251	0.9263	1	Other_Faults
1922	21	47	2867489	2867513	281	43	28	31838	104	126	...	-0.1148	0.8952	1	Other_Faults
1923	23	43	2985008	2985026	170	42	20	19738	107	125	...	-0.0929	0.5951	1	Other_Faults
1924	21	45	3063739	3063768	277	52	42	33476	113	128	...	-0.0558	0.9324	1	Other_Faults
1925	21	39	3411415	3411438	201	46	28	23925	112	126	...	-0.0701	0.6781	1	Other_Faults
1926	1053	1063	173208	173226	68	26	18	8717	120	141	...	0.0015	0.3068	1	Other_Faults
1927	160	170	252936	252958	97	24	23	12775	123	142	...	0.0289	0.3663	1	Other_Faults
1928	1303	1344	33298	33323	452	44	26	42686	65	117	...	-0.2622	0.9920	1	Other_Faults
1929	130	153	197701	197718	240	47	18	30130	118	140	...	-0.0192	0.6438	1	Other_Faults
1930	247	272	244048	244067	235	36	22	29911	119	142	...	-0.0056	0.7598	1	Other_Faults
1931	523	567	266325	266337	209	67	30	26833	119	141	...	0.0030	0.8183	1	Other_Faults
1932	239	269	276029	276047	299	51	22	37820	116	140	...	-0.0118	0.8299	1	Other_Faults
1933	367	422	289647	289665	355	116	58	46882	123	143	...	0.0317	0.9899	1	Other_Faults
1934	137	170	301492	301511	304	59	26	35778	111	126	...	-0.0805	0.8971	1	Other_Faults
1935	238	287	315114	315142	671	91	39	86424	119	143	...	0.0062	0.9992	1	Other_Faults
1936	249	277	325780	325796	273	54	22	35033	119	141	...	0.0026	0.7254	1	Other_Faults
1937	144	175	340581	340598	287	44	24	34599	112	133	...	-0.0582	0.8173	1	Other_Faults
1938	145	174	386779	386794	292	40	22	37572	120	140	...	0.0052	0.7079	1	Other_Faults
1939	137	170	422497	422528	419	97	47	52715	117	140	...	-0.0171	0.9919	1	Other_Faults
1940	1261	1281	87951	87967	103	26	22	11682	101	133	...	-0.1139	0.5296	1	Other_Faults

	X_Minimum	X_Maximum	Y_Minimum	Y_Maximum	Pixels_Areas	X_Perimeter	Y_Perimeter	Sum_of_Luminosity	Minimum_of_Luminosity	Maximum_of_Luminosity	...	Orientation_Index	Luminosity_Index	SigmoidOfAreas	Pastry	Z_Scratch	K_Scatch	Stains	Dirtiness	Bumps	Other_Faults
count	1941.00	1941.00	1941.00	1941.00	1941.00	1941.00	1941.00	1941.00	1941.00	1941.00	...	1941.00	1941.00	1941.00	1941.00	1941.0	1941.0	1941.00	1941.00	1941.00	1941.00
mean	571.14	617.96	1650684.87	1650738.71	1893.88	111.86	82.97	206312.15	84.55	130.19	...	0.08	-0.13	0.59	0.08	0.1	0.2	0.04	0.03	0.21	0.35
std	520.69	497.63	1774578.41	1774590.09	5168.46	301.21	426.48	512293.59	32.13	18.69	...	0.50	0.15	0.34	0.27	0.3	0.4	0.19	0.17	0.41	0.48
min	0.00	4.00	6712.00	6724.00	2.00	2.00	1.00	250.00	0.00	37.00	...	-0.99	-1.00	0.12	0.00	0.0	0.0	0.00	0.00	0.00	0.00
25%	51.00	192.00	471253.00	471281.00	84.00	15.00	13.00	9522.00	63.00	124.00	...	-0.33	-0.20	0.25	0.00	0.0	0.0	0.00	0.00	0.00	0.00
50%	435.00	467.00	1204128.00	1204136.00	174.00	26.00	25.00	19202.00	90.00	127.00	...	0.10	-0.13	0.51	0.00	0.0	0.0	0.00	0.00	0.00	0.00
75%	1053.00	1072.00	2183073.00	2183084.00	822.00	84.00	83.00	83011.00	106.00	140.00	...	0.51	-0.07	1.00	0.00	0.0	0.0	0.00	0.00	0.00	1.00
max	1705.00	1713.00	12987661.00	12987692.00	152655.00	10449.00	18152.00	11591414.00	203.00	253.00	...	0.99	0.64	1.00	1.00	1.0	1.0	1.00	1.00	1.00	1.00

AI/ML 기술 블로그

철판 제조 공정 데이터를 활용한 분류모형 생성 및 성능 비교

데이터 소개¶

종속변수가 7개니까 약간 원핫인코딩처럼 하나씩 1을 돌려가며 만들어준다는 뜻... 복잡¶

1은 True 0 은 False로 자동으로 바꿔주는 astype 함수를 사용¶

이 astype을 일괄적으로 적용하고 싶어 그런함수를 짠 것¶

결국 class가 타겟 값이자 multiple classification이 목적¶

EDA 결측치¶

제조공정특징 - null이 없음¶

말이 안되는 값이 있는지 살펴보기¶

stain과 dirtiness가 너무적어 모델링시 퍼포먼스가 나오지 않을 확률이 높음¶

산점도를 통한 변수간의 상관관계 파악¶

get 함수 -> key를 쓰면 value 를 반환¶

scatter plot - 변수들 상관관계 및 선형관계인지아닌지 변수의 조합 (범주의 구분)¶

결함의 종류이므로 -> type of defect 로 정의¶

bar 값에 count값 출력하기¶

상관계수를 활용한 변수간의 상관관계 파악¶

vmax vmin 은 가장 진한 값이 1인지 -1인지 정하는것¶

Training, Test set 분리하기 y를 k_scatch로 지정해보기¶

z 스코어는 정규화하는것 (표준화)¶

9. [로지스틱 회귀분석] 로지스틱 기본 모형 만들기¶

로지스틱에서 고려해야할 Penalty의 형태 (Ridge, Lasso), regularization parameter range를 설정하여 이를 parameters에 dictionary 형태로 저장합니다.¶

그리드서치 - > 최적의 파라미터 구하기위해¶

[의사결정나무] 의사결정나무 기본 모형 만들기¶

Random Forest¶

Support Vector Machine¶

신경망 모형¶

'Project & Kaggle' 카테고리의 다른 글

'Project & Kaggle'의 다른글

티스토리툴바

	X_Minimum	X_Maximum	Y_Minimum	Y_Maximum	Pixels_Areas	X_Perimeter	Y_Perimeter	Sum_of_Luminosity	Minimum_of_Luminosity	Maximum_of_Luminosity	...	Outside_X_Index	Edges_X_Index	Edges_Y_Index	Outside_Global_Index	LogOfAreas	Log_X_Index	Log_Y_Index	Orientation_Index	Luminosity_Index	SigmoidOfAreas
0
X_Minimum	1.000000	0.988314	0.041821	0.041807	-0.307322	-0.258937	-0.118757	-0.339045	0.237637	-0.075554	...	-0.361160	0.154778	0.367907	0.147282	-0.428553	-0.437944	-0.326851	0.178585	-0.031578	-0.355251
X_Maximum	0.988314	1.000000	0.052147	0.052135	-0.225399	-0.186326	-0.090138	-0.247052	0.168649	-0.062392	...	-0.214930	0.149259	0.271915	0.099253	-0.332169	-0.324012	-0.265990	0.115019	-0.038996	-0.286736
Y_Minimum	0.041821	0.052147	1.000000	1.000000	0.017670	0.023843	0.024150	0.007362	-0.065703	-0.067785	...	0.054165	0.066085	-0.036543	-0.062911	0.044952	0.070406	-0.008442	-0.086497	-0.090654	0.025257
Y_Maximum	0.041807	0.052135	1.000000	1.000000	0.017840	0.024038	0.024380	0.007499	-0.065733	-0.067776	...	0.054185	0.066051	-0.036549	-0.062901	0.044994	0.070432	-0.008382	-0.086480	-0.090666	0.025284
Pixels_Areas	-0.307322	-0.225399	0.017670	0.017840	1.000000	0.966644	0.827199	0.978952	-0.497204	0.110063	...	0.588606	-0.294673	-0.463571	-0.109655	0.650234	0.603072	0.578342	-0.137604	-0.043449	0.422947
X_Perimeter	-0.258937	-0.186326	0.023843	0.024038	0.966644	1.000000	0.912436	0.912956	-0.400427	0.111363	...	0.517098	-0.293039	-0.412100	-0.079106	0.563036	0.524716	0.523472	-0.101731	-0.032617	0.380605
Y_Perimeter	-0.118757	-0.090138	0.024150	0.024380	0.827199	0.912436	1.000000	0.704876	-0.213758	0.061809	...	0.209160	-0.195162	-0.136723	0.013438	0.294040	0.228485	0.344378	0.031381	-0.047778	0.191772
Sum_of_Luminosity	-0.339045	-0.247052	0.007362	0.007499	0.978952	0.912956	0.704876	1.000000	-0.540566	0.136515	...	0.658339	-0.327728	-0.529745	-0.121090	0.712128	0.667736	0.618795	-0.158483	-0.014067	0.464248
Minimum_of_Luminosity	0.237637	0.168649	-0.065703	-0.065733	-0.497204	-0.400427	-0.213758	-0.540566	1.000000	0.429605	...	-0.487574	0.252256	0.316610	0.035462	-0.678762	-0.567655	-0.588208	0.057123	0.669534	-0.514797
Maximum_of_Luminosity	-0.075554	-0.062392	-0.067785	-0.067776	0.110063	0.111363	0.061809	0.136515	0.429605	1.000000	...	0.099300	0.093522	-0.167441	-0.124039	0.007672	0.092823	-0.069522	-0.169747	0.870160	-0.039651
Length_of_Conveyer	0.316662	0.299390	-0.049211	-0.049219	-0.155853	-0.134240	-0.063825	-0.169331	-0.023579	-0.098009	...	-0.217417	0.123585	0.235732	0.128663	-0.193247	-0.219973	-0.157057	0.120715	-0.149769	-0.197543
TypeOfSteel_A300	0.144319	0.112009	0.075164	0.075151	-0.235591	-0.189250	-0.095154	-0.263632	0.042048	-0.216339	...	-0.244765	0.173836	0.240634	0.022142	-0.329614	-0.266955	-0.311796	0.010630	-0.252818	-0.308910
TypeOfSteel_A400	-0.144319	-0.112009	-0.075164	-0.075151	0.235591	0.189250	0.095154	0.263632	-0.042048	0.216339	...	0.244765	-0.173836	-0.240634	-0.022142	0.329614	0.266955	0.311796	-0.010630	0.252818	0.308910
Steel_Plate_Thickness	0.136625	0.106119	-0.207640	-0.207644	-0.183735	-0.147712	-0.058889	-0.204812	0.103393	-0.128397	...	-0.228352	-0.077408	0.251985	0.221244	-0.176639	-0.252822	-0.037287	0.274097	-0.116499	-0.085159
Edges_Index	0.278075	0.242846	0.021314	0.021300	-0.275289	-0.227590	-0.111240	-0.301452	0.358915	0.149675	...	-0.296510	0.250178	0.285302	0.008282	-0.408619	-0.355853	-0.371989	0.020548	0.207516	-0.330006
Empty_Index	-0.198461	-0.152680	-0.043117	-0.043085	0.272808	0.306348	0.188825	0.293691	-0.044111	0.031425	...	0.334996	-0.389342	-0.459800	-0.165293	0.356685	0.448864	0.397289	-0.139420	0.061608	0.481738
Square_Index	0.063658	0.048575	-0.006135	-0.006152	0.017865	0.004507	-0.047511	0.049607	0.066748	0.065517	...	-0.113627	0.242779	0.081488	-0.069913	-0.189340	-0.082846	-0.257661	-0.162034	0.111977	-0.292251
Outside_X_Index	-0.361160	-0.214930	0.054165	0.054185	0.588606	0.517098	0.209160	0.658339	-0.487574	0.099300	...	1.000000	-0.076663	-0.689867	-0.337173	0.710837	0.820223	0.464860	-0.440358	-0.035721	0.518910
Edges_X_Index	0.154778	0.149259	0.066085	0.066051	-0.294673	-0.293039	-0.195162	-0.327728	0.252256	0.093522	...	-0.076663	1.000000	0.108144	-0.419383	-0.496206	-0.189262	-0.748892	-0.550302	0.126460	-0.558426
Edges_Y_Index	0.367907	0.271915	-0.036543	-0.036549	-0.463571	-0.412100	-0.136723	-0.529745	0.316610	-0.167441	...	-0.689867	0.108144	1.000000	0.537565	-0.642991	-0.855414	-0.321892	0.658049	-0.094368	-0.545393
Outside_Global_Index	0.147282	0.099253	-0.062911	-0.062901	-0.109655	-0.079106	0.013438	-0.121090	0.035462	-0.124039	...	-0.337173	-0.419383	0.537565	1.000000	-0.097762	-0.428060	0.241898	0.862670	-0.122321	-0.053770
LogOfAreas	-0.428553	-0.332169	0.044952	0.044994	0.650234	0.563036	0.294040	0.712128	-0.678762	0.007672	...	0.710837	-0.496206	-0.642991	-0.097762	1.000000	0.888919	0.882974	-0.123898	-0.175879	0.877768
Log_X_Index	-0.437944	-0.324012	0.070406	0.070432	0.603072	0.524716	0.228485	0.667736	-0.567655	0.092823	...	0.820223	-0.189262	-0.855414	-0.428060	0.888919	1.000000	0.598652	-0.536629	-0.064923	0.757343
Log_Y_Index	-0.326851	-0.265990	-0.008442	-0.008382	0.578342	0.523472	0.344378	0.618795	-0.588208	-0.069522	...	0.464860	-0.748892	-0.321892	0.241898	0.882974	0.598652	1.000000	0.316792	-0.219110	0.838188
Orientation_Index	0.178585	0.115019	-0.086497	-0.086480	-0.137604	-0.101731	0.031381	-0.158483	0.057123	-0.169747	...	-0.440358	-0.550302	0.658049	0.862670	-0.123898	-0.536629	0.316792	1.000000	-0.153464	-0.023978
Luminosity_Index	-0.031578	-0.038996	-0.090654	-0.090666	-0.043449	-0.032617	-0.047778	-0.014067	0.669534	0.870160	...	-0.035721	0.126460	-0.094368	-0.122321	-0.175879	-0.064923	-0.219110	-0.153464	1.000000	-0.184840
SigmoidOfAreas	-0.355251	-0.286736	0.025257	0.025284	0.422947	0.380605	0.191772	0.464248	-0.514797	-0.039651	...	0.518910	-0.558426	-0.545393	-0.053770	0.877768	0.757343	0.838188	-0.023978	-0.184840	1.000000

	X_Minimum	X_Maximum	Y_Minimum	Y_Maximum	Pixels_Areas	X_Perimeter	Y_Perimeter	Sum_of_Luminosity	Minimum_of_Luminosity	Maximum_of_Luminosity	...	Outside_X_Index	Edges_X_Index	Edges_Y_Index	Outside_Global_Index	LogOfAreas	Log_X_Index	Log_Y_Index	Orientation_Index	Luminosity_Index	SigmoidOfAreas
count	1552.00	1552.00	1552.00	1552.00	1552.00	1552.00	1552.00	1552.00	1552.00	1552.00	...	1552.00	1552.00	1552.00	1552.00	1552.00	1552.00	1552.00	1552.00	1552.00	1552.00
mean	-0.00	-0.00	-0.00	-0.00	-0.00	-0.00	0.00	-0.00	0.00	-0.00	...	-0.00	0.00	-0.00	-0.00	0.00	0.00	-0.00	0.00	0.00	0.00
std	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	...	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
min	-1.10	-1.24	-0.91	-0.91	-0.35	-0.34	-0.18	-0.39	-2.64	-5.07	...	-0.53	-2.47	-3.28	-1.19	-2.77	-2.16	-3.07	-2.14	-5.84	-1.37
25%	-0.99	-0.86	-0.66	-0.66	-0.33	-0.30	-0.15	-0.37	-0.66	-0.35	...	-0.44	-0.80	-0.93	-1.19	-0.71	-0.70	-0.70	-0.83	-0.44	-1.00
50%	-0.26	-0.30	-0.25	-0.25	-0.32	-0.27	-0.13	-0.35	0.17	-0.19	...	-0.39	0.10	0.58	0.89	-0.32	-0.33	-0.17	0.02	-0.01	-0.24
75%	0.93	0.92	0.29	0.29	-0.20	-0.09	-0.01	-0.23	0.66	0.52	...	-0.16	0.76	0.79	0.89	0.53	0.38	0.73	0.84	0.43	1.23
max	2.18	2.20	6.16	6.16	27.59	31.93	38.17	21.41	3.66	6.60	...	13.97	1.58	0.79	0.89	3.42	3.63	6.27	1.83	5.18	1.23

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

철판 제조 공정 데이터를 활용한 분류모형 생성 및 성능 비교

데이터 소개¶

종속변수가 7개니까 약간 원핫인코딩처럼 하나씩 1을 돌려가며 만들어준다는 뜻... 복잡¶

1은 True 0 은 False로 자동으로 바꿔주는 astype 함수를 사용¶

이 astype을 일괄적으로 적용하고 싶어 그런함수를 짠 것¶

결국 class가 타겟 값이자 multiple classification이 목적¶

EDA 결측치¶

제조공정특징 - null이 없음¶

말이 안되는 값이 있는지 살펴보기¶

stain과 dirtiness가 너무적어 모델링시 퍼포먼스가 나오지 않을 확률이 높음¶

산점도를 통한 변수간의 상관관계 파악¶

get 함수 -> key를 쓰면 value 를 반환¶

scatter plot - 변수들 상관관계 및 선형관계인지아닌지 변수의 조합 (범주의 구분)¶

결함의 종류이므로 -> type of defect 로 정의¶

bar 값에 count값 출력하기¶

상관계수를 활용한 변수간의 상관관계 파악¶

vmax vmin 은 가장 진한 값이 1인지 -1인지 정하는것¶

Training, Test set 분리하기 y를 k_scatch로 지정해보기¶

z 스코어는 정규화하는것 (표준화)¶

9. [로지스틱 회귀분석] 로지스틱 기본 모형 만들기¶

로지스틱에서 고려해야할 Penalty의 형태 (Ridge, Lasso), regularization parameter range를 설정하여 이를 parameters에 dictionary 형태로 저장합니다.¶

그리드서치 - > 최적의 파라미터 구하기위해¶

[의사결정나무] 의사결정나무 기본 모형 만들기¶

Random Forest¶

Support Vector Machine¶

신경망 모형¶

'Project & Kaggle' 카테고리의 다른 글

'Project & Kaggle'의 다른글

관련글

티스토리툴바