标签:机器学习、Python、Scikit-learn、FastAPI、Docker、MLOps
文章目录
-
前言:为什么今天仍要写“Hello World”级别的 ML?
-
环境 30 秒搭好:conda + poetry 一条命令
-
数据:用
seaborn自带 Tips 数据集做回归 -
建模:Pipeline + GridSearchCV 三行代码
-
可解释性:SHAP 全局 & 局部解释
-
服务化:FastAPI 自动生成交互式文档
-
容器化:Dockerfile + docker-compose 一键上线
-
CI/CD:GitHub Actions 自动跑测试、构建镜像
-
小结与延伸:从 Notebook 到生产还差几步?
1. 前言
2025 年了,LLM、多模态、Agent 如日中天,为什么还要写一篇“入门级”机器学习文章?
答案很简单:
-
90% 的中小型业务场景仍可用“表格数据 + 树模型”解决;
-
80% 的开发者第一次把模型搬上生产时会踩坑;
-
本文用最少概念、最完整链路带你跑通一次端到端流程,后续想换框架、换数据都只需改“插件”。
2. 环境 30 秒搭好
# 1. 创建隔离环境
conda create -n ml-demo python=3.11 -y && conda activate ml-demo
# 2. 用 poetry 管理依赖(比 requirements.txt 香)
pip install poetry
poetry init --no-interaction --python=">=3.11"
poetry add scikit-learn pandas seaborn fastapi uvicorn shap schemachart
3. 数据:加载 + EDA 两行搞定
import seaborn as sns
import pandas as pd
df = sns.load_dataset("tips")
print(df.head())
# 快速看一眼缺失值
print(df.isna().mean().sort_values(ascending=False))
Tips 数据集 244 行 × 7 列,无缺失,直接开干。
4. 建模:Pipeline + GridSearchCV
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.pipeline import Pipeline
import joblib
X = df.drop(columns=["tip"])
y = df["tip"]
num_cols = X.select_dtypes(include=["int64", "float64"]).columns
cat_cols = X.select_dtypes(include=["category", "object"]).columns
pre = ColumnTransformer([
("num", "passthrough", num_cols),
("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
])
pipe = Pipeline([
("prep", pre),
("gbr", GradientBoostingRegressor())
])
param = {
"gbr__n_estimators": [100, 200],
"gbr__max_depth": [2, 3],
"gbr__learning_rate": [0.05, 0.1]
}
model = GridSearchCV(pipe, param, cv=5, n_jobs=-1, scoring="neg_root_mean_squared_error")
model.fit(X, y)
print("Best RMSE:", -model.best_score_)
joblib.dump(model.best_estimator_, "tip_model.pkl")

5. 可解释性:SHAP 全局 & 局部
import shap
explainer = shap.TreeExplainer(model.best_estimator_["gbr"])
shap_values = explainer.shap_values(
model.best_estimator_["prep"].transform(X)
)
# 全局
shap.summary_plot(shap_values, X, show=False)
# 局部:第 0 条样本
shap.force_plot(
explainer.expected_value,
shap_values[0],
X.iloc[0],
matplotlib=True
)
6. 服务化:FastAPI + Pydantic
# main.py
from fastapi import FastAPI
from pydantic import BaseModel
import joblib, pandas as pd
model = joblib.load("tip_model.pkl")
class Item(BaseModel):
total_bill: float
sex: str
smoker: str
day: str
time: str
size: int
app = FastAPI(title="Tips 小费预测 API")
@app.post("/predict")
def predict(item: Item):
df = pd.DataFrame([item.model_dump()])
pred = model.predict(df)[0]
return {"tip": round(float(pred), 2)}
启动:
uvicorn main:app --reload
浏览器打开 http://127.0.0.1:8000/docs 即可交互式调试。
7. 容器化:Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY poetry.lock pyproject.toml ./
RUN pip install poetry && poetry config virtualenvs.create false \
&& poetry install --no-dev
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
一键跑:
docker build -t tip-api .
docker run -p 8000:8000 tip-api
8. CI/CD:GitHub Actions
.github/workflows/ci.yml
name: ci
on: [push]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v5
with:
push: true
tags: ghcr.io/${{ github.repository_owner }}/tip-api:latest
9. 小结 & 延伸
| 阶段 | 本文做法 | 生产级升级 |
|---|---|---|
| 数据 | 内置数据集 | Feature Store、数据血缘 |
| 训练 | GridSearchCV | Optuna、Ray Tune |
| 解释 | SHAP | 自定义业务指标 |
| 部署 | FastAPI | Seldon Core、Kubeflow |
| 监控 | 无 | Prometheus + Evidently |
把以上“积木”逐个替换,你的 MLOps 体系就自然生长出来了。

864

被折叠的 条评论
为什么被折叠?



