AI 模型到底怎么做出来的?用代码带你走完整个流程!

🧠 Step 1:定义问题与收集数据

在任何 AI 建模前,你必须清楚地定义:你要解决什么问题?目标是分类、回归、推荐还是预测?

📌 示例任务:预测客户是否流失(Customer Churn)

✅ 示例数据集:Telco Customer Churn(Kaggle)

# 下载数据
wget https://raw.githubusercontent.com/blastchar/telco-customer-churn/master/WA_Fn-UseC_-Telco-Customer-Churn.csv -O churn.csv

🛠️ Step 2:清洗数据 + 特征工程

import pandas as pd

df = pd.read_csv('churn.csv')

# 删除客户 ID
df.drop('customerID', axis=1, inplace=True)

# 替换 'No internet service' 为 'No'
df.replace('No internet service', 'No', inplace=True)

# 缺失值处理
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df.dropna(inplace=True)

# 标签编码
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})

🚀 One-Hot 编码 + 特征选择

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

categorical_cols = df.select_dtypes(include='object').columns.tolist()
X = df.drop('Churn', axis=1)
y = df['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 构建管道
pipeline = Pipeline(steps=[
    ('preprocess', ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ], remainder='passthrough')),
    ('model', RandomForestClassifier(n_estimators=100, random_state=42))
])

🧪 Step 3:训练 + 评估模型

from sklearn.metrics import classification_report, accuracy_score

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

输出示例:

makefileCopyEditAccuracy: 0.80
Precision: 0.75
Recall: 0.72

🧪 Step 4:优化模型性能

可以尝试:

  • ❇️ 超参数搜索:GridSearchCV
  • ❇️ 更换模型:XGBoost, LightGBM
  • ❇️ 特征选择:PCA、RFE

🎯 示例:GridSearch 优化

from sklearn.model_selection import GridSearchCV

param_grid = {
    'model__n_estimators': [100, 200],
    'model__max_depth': [5, 10, None],
}

search = GridSearchCV(pipeline, param_grid, cv=3, scoring='accuracy', verbose=2)
search.fit(X_train, y_train)

print("最佳参数:", search.best_params_)

🌐 Step 5:部署为在线服务(Flask API)

创建 app.py

from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)
model = joblib.load('churn_model.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    df = pd.DataFrame([data])
    prediction = model.predict(df)[0]
    return jsonify({'churn': int(prediction)})

if __name__ == '__main__':
    app.run(debug=True)

保存模型:

import joblib
joblib.dump(pipeline, 'churn_model.pkl')

运行服务:

python app.py

发送测试请求:

curl -X POST http://localhost:5000/predict \
-H "Content-Type: application/json" \
-d '{"gender": "Female", "SeniorCitizen": 0, ... }'

🗂️ 项目结构参考

├── churn.csv
├── churn_model.pkl
├── app.py
├── train.py
├── requirements.txt

✅ 总结回顾

步骤内容
Step 1明确问题,准备数据集
Step 2清洗 + 特征工程
Step 3模型训练与验证
Step 4性能优化
Step 5Flask 在线部署

📌 附加推荐

  • 📦 推荐模型管理平台:MLflow、DVC
  • 🧠 推荐部署方式:FastAPI + Docker
  • 🔁 推荐自动化:CI/CD + GitHub Actions
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

最爱茄子包

谢谢鼓励!

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值