🧠 Step 1:定义问题与收集数据
在任何 AI 建模前,你必须清楚地定义:你要解决什么问题?目标是分类、回归、推荐还是预测?
📌 示例任务:预测客户是否流失(Customer Churn)
✅ 示例数据集:Telco Customer Churn(Kaggle)
# 下载数据
wget https://raw.githubusercontent.com/blastchar/telco-customer-churn/master/WA_Fn-UseC_-Telco-Customer-Churn.csv -O churn.csv
🛠️ Step 2:清洗数据 + 特征工程
import pandas as pd
df = pd.read_csv('churn.csv')
# 删除客户 ID
df.drop('customerID', axis=1, inplace=True)
# 替换 'No internet service' 为 'No'
df.replace('No internet service', 'No', inplace=True)
# 缺失值处理
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df.dropna(inplace=True)
# 标签编码
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})
🚀 One-Hot 编码 + 特征选择
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
categorical_cols = df.select_dtypes(include='object').columns.tolist()
X = df.drop('Churn', axis=1)
y = df['Churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 构建管道
pipeline = Pipeline(steps=[
('preprocess', ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
], remainder='passthrough')),
('model', RandomForestClassifier(n_estimators=100, random_state=42))
])
🧪 Step 3:训练 + 评估模型
from sklearn.metrics import classification_report, accuracy_score
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))
输出示例:
makefileCopyEditAccuracy: 0.80
Precision: 0.75
Recall: 0.72
🧪 Step 4:优化模型性能
可以尝试:
- ❇️ 超参数搜索:
GridSearchCV
- ❇️ 更换模型:
XGBoost
,LightGBM
- ❇️ 特征选择:PCA、RFE
🎯 示例:GridSearch 优化
from sklearn.model_selection import GridSearchCV
param_grid = {
'model__n_estimators': [100, 200],
'model__max_depth': [5, 10, None],
}
search = GridSearchCV(pipeline, param_grid, cv=3, scoring='accuracy', verbose=2)
search.fit(X_train, y_train)
print("最佳参数:", search.best_params_)
🌐 Step 5:部署为在线服务(Flask API)
创建 app.py
:
from flask import Flask, request, jsonify
import joblib
app = Flask(__name__)
model = joblib.load('churn_model.pkl')
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
df = pd.DataFrame([data])
prediction = model.predict(df)[0]
return jsonify({'churn': int(prediction)})
if __name__ == '__main__':
app.run(debug=True)
保存模型:
import joblib
joblib.dump(pipeline, 'churn_model.pkl')
运行服务:
python app.py
发送测试请求:
curl -X POST http://localhost:5000/predict \
-H "Content-Type: application/json" \
-d '{"gender": "Female", "SeniorCitizen": 0, ... }'
🗂️ 项目结构参考
├── churn.csv
├── churn_model.pkl
├── app.py
├── train.py
├── requirements.txt
✅ 总结回顾
步骤 | 内容 |
---|---|
Step 1 | 明确问题,准备数据集 |
Step 2 | 清洗 + 特征工程 |
Step 3 | 模型训练与验证 |
Step 4 | 性能优化 |
Step 5 | Flask 在线部署 |
📌 附加推荐
- 📦 推荐模型管理平台:MLflow、DVC
- 🧠 推荐部署方式:FastAPI + Docker
- 🔁 推荐自动化:CI/CD + GitHub Actions