[딥러닝] 손글씨 이미지 데이터 분류

딥러닝

[딥러닝] 손글씨 이미지 데이터 분류

퓨어맨 2022. 7. 18. 12:48

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# keras에서 지원하는 딥러닝 학습용 손글씨 데이터셋 임포트(국립표준기술원(NIST)의 데이터셋을 수정(Modified)해서 만든 데이터 셋)
from tensorflow.keras.datasets import mnist

data = mnist.load_data()

len(data)

# 데이터가 3차원 배열로 크게는 train, test로 나뉘어져 있고
# 각 train, test 안에 문제와 정답 데이터로 한번 더 나뉘어져 있는 구조

print(len(data[0]))      # train
print(len(data[1]))      # test
print(len(data[0][0]))   # X_train
print(len(data[0][1]))   # y_train
print(len(data[1][0]))   # X_test
print(len(data[1][1]))   # y_test

X_train = data[0][0]
y_train = data[0][1]
X_test = data[1][0]
y_test = data[1][1]

# 이미지라는 2차원 데이터를 다루기 때문에 shape의 형태는 3칸이 나오게 됨
# (데이터의 수, 가로픽셀 수, 세로픽셀 수)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(60000, 28, 28)
(60000,)
(10000, 28, 28)
(10000,)

plt.imshow(X_train[59999], cmap='gray');
# imshow : 이미지 데이터를 그림으로 출력해주는 명령
# cmap = 'gray' : 이미지를 흑백으로 출력시켜주는 명령

# subplots : 여러개의 그래프를 하나의 공간에 표시해주는 함수
# f : 전체 그래프를 조절하는 변수
# axes : 내부 그래프를 조절하는 변수
f, axes = plt.subplots(2,2) # 가로2개, 세로2개씩 내부 그래프를 표시

# 전체 그래프의 사이즈 설정
f.set_size_inches(8,8)

# 인덱싱으로 내부 그래프에 접근
axes[0][0].imshow(X_train[59999], cmap='gray')
axes[0][1].imshow(X_train[59998], cmap='gray')
axes[1][0].imshow(X_train[59997], cmap='gray')
axes[1][1].imshow(X_train[59996], cmap='gray')

plt.show

다중분류 문제에는 정답 데이터를 원핫인코딩 시켜줘야 한다

pd.get_dummies : pandas에서 지원해주는 원핫인코딩 명령
to_categorical : keras에서 지원해주는 원핫인코딩 명령

- 원핫인코딩을 시켜주는 이유

scikit-learn에서 제공하는 머신러닝 알고리즘은 문자열 값을 입력 값으로 허락하지 않기 때문에 모든 문자열 값들을 숫자형으로 인코딩하는 전처리 작업(Preprocessing) 후에 머신러닝 모델에 학습을 시켜야 한다.

from tensorflow.keras.utils import to_categorical

y_train_one_hot = to_categorical(y_train)
y_test_one_hot = to_categorical(y_test)

print(y_train_one_hot.shape)
print(y_test_one_hot.shape)

(60000, 10)
(10000, 10)

신경망에는 2차원인 이미지데이터를 한번에 넣을 수가 없기 때문에 데이터의 차원을 1차원으로 변경시켜줘야 한다.

# -1의 의미 : 60000을 제외한 나머지 값들을 전부 곱해서 일렬로 펴줌
X_train = X_train.reshape(60000, -1)
X_test = X_test.reshape(10000, -1)

X_train.shape

(60000, 784)

신경망 구조 설계

입력되는 특성 수 : 784
출력층 활성화함수 : softmax
출력층 뉴런 개수 : 10
손실함수(loss) : categorical_crossentropy
최적화함수 : Adam

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

model = Sequential()

# 입력층 + 중간층 (X_train의 특성 개수를 입력)
model.add(Dense(500, input_dim=784, activation='relu'))

# 중간층
model.add(Dense(400, activation='relu'))
model.add(Dense(200, activation='relu'))

# 출력층 뉴런의 개수는 원핫인코딩 된 컬럼 개수
model.add(Dense(10, activation='softmax'))

model.summary()

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_23 (Dense)            (None, 500)               392500    
                                                                 
 dense_24 (Dense)            (None, 400)               200400    
                                                                 
 dense_25 (Dense)            (None, 200)               80200     
                                                                 
 dense_26 (Dense)            (None, 10)                2010      
                                                                 
=================================================================
Total params: 675,110
Trainable params: 675,110
Non-trainable params: 0
_________________________________________________________________

# 학습 및 평가방법 설정
# categorical_crossentropy : 다중분류에 사용하는 손실함수
model.compile(loss='categorical_crossentropy',
              optimizer='Adam',   # 최적화함수 : 최근에 가장 많이 사용되는 일반적으로 성능이 좋은 최적화함
              metrics=['acc']    # metrics : 평가방법을 설정(분류문제이기 때문에 정확도를 넣어줌)
              )

# 학습
h = model.fit(X_train, y_train_one_hot, epochs=30 , verbose=1)
# verbose : 학습 결과의 출력 형태를 설정하는 명령(0: 출력 x, 1: bar형태(디폴트값), 2:bar가 없는 형태)

plt.figure(figsize=(15,5))

plt.plot(h.history['acc'], label='acc')

plt.legend()
plt.show()

- 학습중 과대적합을 확인하기 위해 train데이터에서 검증데이터셋을 분리해서 학습시 같이 출력해보기

from sklearn.model_selection import train_test_split

X_train, X_val, y_train_one_hot, y_val_one_hot = train_test_split(X_train,
                                                                  y_train_one_hot,
                                                                  random_state=3
                                                                  )
                                                                  
print(X_train.shape)
print(X_val.shape)              # 검증용 문제
print(y_train_one_hot.shape)
print(y_val_one_hot.shape)      # 검증용 정답

(45000, 784)
(15000, 784)
(45000, 10)
(15000, 10)

model1 = Sequential()

# 입력층 + 중간층 (X_train의 특성 개수를 입력)
model1.add(Dense(500, input_dim=784, activation='relu'))

# 중간층
model1.add(Dense(400, activation='relu'))
model1.add(Dense(200, activation='relu'))

# 출력층 뉴런의 개수는 원핫인코딩 된 컬럼 개수
model1.add(Dense(10, activation='softmax'))

model1.summary()

Model: "sequential_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_31 (Dense)            (None, 500)               392500    
                                                                 
 dense_32 (Dense)            (None, 400)               200400    
                                                                 
 dense_33 (Dense)            (None, 200)               80200     
                                                                 
 dense_34 (Dense)            (None, 10)                2010      
                                                                 
=================================================================
Total params: 675,110
Trainable params: 675,110
Non-trainable params: 0
_________________________________________________________________

# 학습 및 평가방법 설정
# categorical_crossentropy : 다중분류에 사용하는 손실함수
model1.compile(loss='categorical_crossentropy',
              optimizer='Adam',   # 최적화함수 : 최근에 가장 많이 사용되는 일반적으로 성능이 좋은 최적화함
              metrics=['acc']    # metrics : 평가방법을 설정(분류문제이기 때문에 정확도를 넣어줌)
              )
              
# 학습
h1 = model1.fit(X_train, y_train_one_hot, 
                epochs=30 , 
                verbose=1,
                # 검증용 데이터셋을 추가해주는 명령
                validation_data=(X_val, y_val_one_hot)
                )
# verbose : 학습 결과의 출력 형태를 설정하는 명령(0: 출력 x, 1: bar형태(디폴트값), 2:bar가 없는 형태)

plt.figure(figsize=(15,5))

# train 데이터
plt.plot(h1.history['acc'],
         label='acc',
         c = 'blue',
         marker = '.'
         )

# val 데이터
plt.plot(h1.history['val_acc'],
         label='val_acc',
         c = 'red',
         marker='.'
         )

plt.xlabel("epochs")
plt.ylabel("accuracy")

plt.legend()
plt.show()

model1.evaluate(X_test, y_test_one_hot)

313/313 [==============================] - 1s 3ms/step - loss: 0.2567 - acc: 0.9751
[0.256673663854599, 0.9750999808311462]