Scikit-learn — 머신러닝 실습 입문

0. 시리즈

기본편	제목	역할
기본 1편	NumPy 완벽 정리 — 숫자 배열 다루기	데이터 기초
기본 2편	Pandas 완벽 정리 — 데이터 불러오고 정리하기	데이터 전처리
기본 3편	Matplotlib / Seaborn — 데이터 시각화	시각화
기본 4편⬅️	Scikit-learn — 머신러닝 실습 입문	ML 실습
기본 5편	TensorFlow / Keras — 딥러닝 모델 만들기	DL 실습
기본 6편	PyTorch 입문 — Keras와 무엇이 다를까?	DL 심화

실습코드

bushgit clone https://github.com/duckgeunpark/Ai-practice.git

1. Scikit-learn이란?

1-1. Scikit-learn이 뭔가요?

Scikit-learn(사이킷런) 은 Python에서 머신러닝 알고리즘을 쉽게 구현할 수 있도록 모아놓은 라이브러리입니다. 분류, 회귀, 클러스터링 등 수십 가지 알고리즘이 동일한 방식 으로 구현되어 있어, 하나의 사용법만 익히면 모든 알고리즘을 같은 패턴으로 사용할 수 있습니다.

1-2. AI 파이프라인에서 Scikit-learn의 위치

[원본 데이터 (CSV)]
       ↓
 📦 Pandas — 전처리, X/y 분리   (2편)
       ↓
 📊 Matplotlib/Seaborn — 탐색  (3편)
       ↓
 🤖 Scikit-learn — 모델 학습·예측·평가  ← 지금 여기
       ↓
 🧠 TensorFlow/Keras — 딥러닝  (5편)

1-3. 설치 및 임포트

bashpip install scikit-learn

pythonimport sklearn
print(sklearn.__version__)  # 예: 1.4.0

Google Colab은 이미 설치되어 있어 별도 설치 불필요합니다.

2. 머신러닝의 기본 흐름

2-1. 전체 흐름

1. 데이터 준비    — Pandas로 불러오고 전처리
2. 모델 선택      — KNN? 결정 트리? 랜덤 포레스트?
3. 학습 (fit)     — 모델이 데이터에서 패턴을 찾아냄
4. 예측 (predict) — 새 데이터에 대해 결과를 출력
5. 평가 (score)   — 모델이 얼마나 잘 맞추는지 확인

2-2. Scikit-learn의 공통 API 패턴

Scikit-learn의 가장 큰 장점은 모든 모델이 동일한 3단계 로 사용된다는 점입니다.

pythonfrom sklearn.XXX import 모델이름

model = 모델이름()
model.fit(X_train, y_train)   # 학습
model.predict(X_test)         # 예측
model.score(X_test, y_test)   # 평가

KNN이든 랜덤 포레스트든 선형 회귀든 항상 이 패턴 입니다.

데이터 준비

pythonfrom sklearn.datasets import load_iris

iris = load_iris()
X = iris.data              # 특성 (150, 4)
y = iris.target            # 레이블 (150,)
feature_names = iris.feature_names

print(f"데이터 크기: {X.shape}")
print(f"특성 이름: {feature_names}")
print(f"클래스: {list(iris.target_names)}")

Output

데이터 크기: (150, 4)
특성 이름: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
클래스: [np.str_('setosa'), np.str_('versicolor'), np.str_('virginica')]

2-3. 훈련 / 테스트 데이터 분리

모델이 "시험 문제"를 미리 보고 외우는 것을 막기 위해, 데이터를 훈련용과 테스트용 으로 나눕니다.

pythonfrom sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # 20%를 테스트용으로
    random_state=42,    # 재현 가능한 결과를 위한 시드값
    stratify=y          # 클래스 비율 유지 (분류 문제에서 권장)
)

print(X_train.shape)
print(X_test.shape)

Output

(120, 4)
(30, 4)

💡 random_state=42는 결과가 실행할 때마다 달라지지 않도록 고정하는 값입니다. 숫자 자체는 의미 없고 관습적으로 42를 많이 씁니다.

3. 분류(Classification)

3-1. 분류란?

분류는 입력 데이터가 어떤 카테고리에 속하는지 판단하는 문제입니다.

이메일이 스팸인가 아닌가? → 이진 분류
이 꽃이 setosa, versicolor, virginica 중 어느 종인가? → 다중 분류

3-2. K-최근접 이웃 (KNN)

"주변에 가장 가까운 K개의 이웃을 보고, 다수결로 결정한다"는 직관적인 알고리즘입니다.

pythonfrom sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)

predictions = model.predict(X_test)
print(model.score(X_test, y_test))

Output

1.0

3-3. 결정 트리 (Decision Tree)

"조건을 하나씩 따라가다 보면 답이 나오는" 스무고개 방식의 알고리즘입니다.

pythonfrom sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=3, random_state=42)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

Output

0.9666666666666667

3-4. 랜덤 포레스트 (Random Forest)

결정 트리 여러 개를 만들어서 다수결로 결정합니다. 결정 트리 한 개보다 훨씬 안정적입니다.

pythonfrom sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

import pandas as pd
importance = pd.Series(model.feature_importances_, index=feature_names)
print(importance.sort_values(ascending=False))

Output

정확도: 0.9333
petal width (cm)     0.438141
petal length (cm)    0.431641
sepal length (cm)    0.115972
sepal width (cm)     0.014246
dtype: float64

3-5. 실습 — 붓꽃(iris) 품종 분류

pythonfrom sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

print(f"정확도: {model.score(X_test, y_test):.4f}")  # 예: 1.0000

Output

정확도: 0.9000

4. 회귀(Regression)

4-1. 회귀란?

회귀는 연속적인 숫자 값을 예측하는 문제입니다.

공부 시간으로 시험 점수 예측
집의 면적, 층수, 위치로 집값 예측
날씨 데이터로 내일 기온 예측

4-2. 선형 회귀 (Linear Regression)

데이터에 가장 잘 맞는 직선을 찾아 예측합니다.

pythonfrom sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[2], [4], [6], [8], [10]])
y = np.array([50, 65, 75, 85, 95])

model = LinearRegression()
model.fit(X, y)

print(model.predict([[7]]))            # 예: [80.0]
print(f"기울기: {model.coef_[0]:.2f}")
print(f"절편: {model.intercept_:.2f}")
print(f"R² 점수: {model.score(X, y):.4f}")

Output

[79.5]
기울기: 5.50
절편: 41.00
R² 점수: 0.9918

4-3. 실습 — 공부 시간으로 점수 예측 + 시각화

pythonimport matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams["font.family"] = "Malgun Gothic"
matplotlib.rcParams["axes.unicode_minus"] = False
import numpy as np
from sklearn.linear_model import LinearRegression

X = np.array([[2], [4], [6], [8], [10], [12]])
y = np.array([50, 63, 72, 84, 91, 98])

model = LinearRegression()
model.fit(X, y)

x_range = np.linspace(0, 14, 100).reshape(-1, 1)
y_pred = model.predict(x_range)

plt.scatter(X, y, color="blue", label="실제 데이터", s=80)
plt.plot(x_range, y_pred, color="red", label="예측 직선")
plt.xlabel("공부 시간")
plt.ylabel("점수")
plt.title("선형 회귀 — 공부 시간 vs 점수")
plt.legend()
plt.grid(True)
plt.show()

5. 모델 평가

5-1. 분류 평가 지표

단순 정확도만으로는 모델을 제대로 평가하기 어렵습니다. 예를 들어 불량품 탐지에서 99%가 정상품이라면, 무조건 "정상"이라고 예측해도 정확도 99%가 나옵니다.

pythonfrom sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# iris 분류 모델 재구성 (섹션 4에서 변수가 덮어씌워졌으므로)
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42, stratify=iris.target
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Output

[       name ]    precision    recall  f1-score   support
[     setosa ]      1.00      1.00      1.00        10
[ versicolor ]      0.82      0.90      0.86        10
[  virginica ]      0.89      0.80      0.84        10
[   accuracy ]                          0.90        30
[  macro avg ]      0.90      0.90      0.90        30
[weighted avg]      0.90      0.90      0.90        30

지표	의미
정확도 (Accuracy)	전체 중 맞춘 비율
정밀도 (Precision)	"양성"이라 예측한 것 중 실제 양성 비율
재현율 (Recall)	실제 양성 중 양성이라고 맞춘 비율
F1 Score	정밀도와 재현율의 조화 평균

5-2. 회귀 평가 지표

pythonfrom sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
import numpy as np

# 회귀 데이터 준비
X_reg = np.array([[2], [4], [6], [8], [10], [12]])
y_reg = np.array([50, 63, 72, 84, 91, 98])

reg_model = LinearRegression()
reg_model.fit(X_reg, y_reg)
y_reg_pred = reg_model.predict(X_reg)

print(f"MAE:  {mean_absolute_error(y_reg, y_reg_pred):.2f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_reg, y_reg_pred)):.2f}")
print(f"R²:   {r2_score(y_reg, y_reg_pred):.4f}")

Output

MAE:  1.56
RMSE: 1.85
R²:   0.9874

5-3. 혼동 행렬 (Confusion Matrix) 시각화

pythonimport seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams["font.family"] = "Malgun Gothic"
matplotlib.rcParams["axes.unicode_minus"] = False
from sklearn.metrics import confusion_matrix

# 5-1에서 만든 분류 결과 사용
cm = confusion_matrix(y_test, y_pred)

sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=iris.target_names,
            yticklabels=iris.target_names)

plt.xlabel("예측값")
plt.ylabel("실제값")
plt.title("혼동 행렬")
plt.show()

혼동 행렬 읽는 법:
           예측: 고양이  예측: 개
실제: 고양이    10(정답)    2(오답)
실제: 개         1(오답)   12(정답)

→ 대각선이 정답, 나머지가 오답

6. 데이터 전처리 — Scikit-learn 방식

6-1. 왜 스케일링이 필요한가?

나이: 25, 30, 27  →  범위: 20~40
연봉: 3000, 5000, 4200  →  범위: 1000~10000

단위가 다르면 KNN 같은 거리 기반 알고리즘은 연봉에만 치우쳐 계산합니다. 스케일링으로 범위를 맞춰줘야 공평하게 비교할 수 있습니다.

6-2. 표준화 / 정규화

pythonfrom sklearn.preprocessing import StandardScaler, MinMaxScaler

# 5-1에서 만든 X_train, X_test 사용

# 표준화 — 평균 0, 표준편차 1로 변환
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # 학습 + 변환
X_test_scaled  = scaler.transform(X_test)       # 변환만 (fit 하면 안 됨!)
print("표준화 후 평균:", X_train_scaled.mean(axis=0).round(2))
print("표준화 후 표준편차:", X_train_scaled.std(axis=0).round(2))

# 정규화 — 0~1 사이로 변환
scaler2 = MinMaxScaler()
X_train_norm = scaler2.fit_transform(X_train)
X_test_norm  = scaler2.transform(X_test)
print("\n정규화 후 최솟값:", X_train_norm.min(axis=0))
print("정규화 후 최댓값:", X_train_norm.max(axis=0))

Output

표준화 후 평균: [-0. -0.  0.  0.]
표준화 후 표준편차: [1. 1. 1. 1.]
정규화 후 최솟값: [0. 0. 0. 0.]
정규화 후 최댓값: [1. 1. 1. 1.]

⚠️ 중요: fit_transform은 훈련 데이터에만 사용하고, 테스트 데이터에는 transform만 사용합니다. 테스트 데이터의 정보가 학습에 새어 들어가는 데이터 누수(Data Leakage) 를 방지하기 위해서입니다.

6-3. 범주형 데이터 인코딩

AI 모델은 문자를 직접 처리하지 못합니다. 문자 → 숫자로 변환해야 합니다.

pythonfrom sklearn.preprocessing import LabelEncoder

# LabelEncoder — 순서가 있는 데이터에 사용
le = LabelEncoder()
labels = ["개", "고양이", "새", "개", "고양이"]
print(le.fit_transform(labels))  # [0 1 2 0 1]

# OneHotEncoder — 순서가 없는 데이터에 사용 (권장)
import pandas as pd
df = pd.DataFrame({"동물": ["개", "고양이", "새"]})
encoded = pd.get_dummies(df["동물"])
print(encoded)

Output

[0 1 2 0 1]
    개    고양이      새
0   True  False  False
1  False   True  False
2  False  False   True

7. 전체 파이프라인 실습

7-1. 지금까지 배운 것 연결하기

2편(Pandas)       → 데이터 불러오기, 전처리, X/y 분리
3편(Seaborn)      → 데이터 분포, 상관관계 시각화
4편(Scikit-learn) → 스케일링 → 모델 학습 → 평가

7-2. Pipeline으로 한 번에 처리하기

전처리와 모델 학습을 하나의 흐름으로 연결합니다.

pythonfrom sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", RandomForestClassifier())
])

pipeline.fit(X_train, y_train)
print(pipeline.score(X_test, y_test))

Output

0.9333333333333333

7-3. 실습 — 타이타닉 생존 예측 (처음부터 끝까지)

pythonimport pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams["font.family"] = "Malgun Gothic"
matplotlib.rcParams["axes.unicode_minus"] = False
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# 1. 데이터 불러오기
df = sns.load_dataset("titanic")

# 2. 탐색
sns.countplot(data=df, x="survived", hue="sex")
plt.title("성별 생존 여부")
plt.show()

# 3. 전처리
df = df[["survived", "pclass", "sex", "age", "fare"]].dropna()
df["sex"] = df["sex"].map({"male": 0, "female": 1})

X = df.drop(columns=["survived"]).to_numpy()
y = df["survived"].to_numpy()

# 4. 훈련/테스트 분리
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 5. 모델 학습
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 6. 평가
print(f"정확도: {model.score(X_test, y_test):.4f}")
print(classification_report(y_test, model.predict(X_test)))

# 7. 혼동 행렬 시각화
cm = confusion_matrix(y_test, model.predict(X_test))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=["사망", "생존"],
            yticklabels=["사망", "생존"])
plt.title("타이타닉 생존 예측 — 혼동 행렬")
plt.show()

Output

정확도: 0.7972
|              |  precision    recall  f1-score   support
|           0  |     0.82      0.85      0.83        85
|           1  |     0.76      0.72      0.74        58
|    accuracy  |                         0.80       143
|   macro avg  |     0.79      0.79      0.79       143
|weighted avg  |     0.80      0.80      0.80       143

8. 마무리

8-1. 오늘 배운 것 한눈에 정리

개념	핵심 내용
공통 API	`fit` → `predict` → `score` 패턴 모든 모델 동일
분류	KNN, 결정 트리, 랜덤 포레스트
회귀	선형 회귀로 연속 값 예측
평가	분류는 F1 Score, 회귀는 R² / RMSE
스케일링	테스트 데이터엔 `transform`만 사용
Pipeline	전처리 + 학습을 하나의 흐름으로