端到端 Spark ML 项目实践:基于 Kaggle 的航班延误预测
下面我们使用 Kaggle 上的航班延误数据集完成一个完整的分布式机器学习项目,重点演示 Pipeline 的使用和分布式训练特点。
环境准备
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, hour, dayofweek, month
from pyspark.sql.types import IntegerType, FloatType
import matplotlib.pyplot as plt
# 初始化 SparkSession (分布式入口点)
spark = SparkSession.builder \
.appName("FlightDelayPrediction") \
.config("spark.executor.memory", "8g") \
.config("spark.driver.memory", "4g") \
.config("spark.sql.shuffle.partitions", "200") \
.getOrCreate()
数据加载与探索
# 从 S3 加载数据 (分布式存储)
flights = spark.read.csv("s3a://flight-delays/2015.csv", header=True, inferSchema=True)
# 查看数据规模 (分布式计数)
print(f"数据集大小: {flights.count():,} 行 x {len(flights.columns)} 列")
# 数据抽样展示
flights.select("YEAR", "MONTH", "DAY", "AIRLINE", "ORIGIN_AIRPORT",
"DESTINATION_AIRPORT", "DEPARTURE_DELAY", "ARRIVAL_DELAY") \
.sample(0.001).show(5)
# 目标变量分布
flights.groupBy(when(col("ARRIVAL_DELAY") > 15, 1).otherwise(0).alias("is_delayed")) \
.count() \
.withColumn("percentage", col("count") / flights.count() * 100) \
.show()
特征工程
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline
# 1. 定义目标变量
flights = flights.withColumn("DELAYED", when(col("ARRIVAL_DELAY") > 15, 1).otherwise(0))
# 2. 特征选择与转换
# 时间特征
flights = flights.withColumn("DEP_HOUR", hour("CRS_DEP_TIME").cast(IntegerType()))
flights = flights.withColumn("DAY_OF_WEEK", dayofweek("FL_DATE").cast(IntegerType()))
flights = flights.withColumn("MONTH", month("FL_DATE").cast(IntegerType()))
# 3. 类别特征索引化
categorical_cols = ["AIRLINE", "ORIGIN_AIRPORT", "DESTINATION_AIRPORT"]
stages = []
for col_name in categorical_cols:
# 字符串索引化
indexer = StringIndexer(inputCol=col_name, outputCol=col_name + "_IDX")
# 独热编码
encoder = OneHotEncoder(inputCol=col_name + "_IDX", outputCol=col_name + "_VEC")
stages += [indexer, encoder]
# 4. 数值特征
numeric_cols = ["DISTANCE", "DEP_HOUR", "DAY_OF_WEEK", "MONTH"]
# 5. 特征组合
assembler_inputs = [c + "_VEC" for c in categorical_cols] + numeric_cols
assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features")
stages += [assembler]
# 6. 创建特征工程Pipeline
feature_pipeline = Pipeline(stages=stages)
feature_model = feature_pipeline.fit(flights)
processed_data = feature_model.transform(flights)
# 查看处理后的数据
processed_data.select("features", "DELAYED").show(5, truncate=False)
模型训练与调优
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# 1. 数据分割
train_data, test_data = processed_data.randomSplit([0.8, 0.2], seed=42)
# 2. 初始化模型
rf = RandomForestClassifier(labelCol="DELAYED", featuresCol="features",
seed=42, cacheNodeIds=True, maxBins=1000)
# 3. 创建参数网格
param_grid = (ParamGridBuilder()
.addGrid(rf.numTrees, [50, 100]) # 树的数量
.addGrid(rf.maxDepth, [5, 10]) # 树的最大深度
.addGrid(rf.maxBins, [500, 1000]) # 连续特征分箱数
.build())
# 4. 创建评估器
evaluator = BinaryClassificationEvaluator(labelCol="DELAYED", metricName="areaUnderROC")
# 5. 创建交叉验证器 (分布式调优)
cv = CrossValidator(estimator=rf,
estimatorParamMaps=param_grid,
evaluator=evaluator,
numFolds=3,
parallelism=8) # 并行任务数
# 6. 训练模型 (分布式训练)
cv_model = cv.fit(train_data)
# 7. 获取最佳模型
best_model = cv_model.bestModel
print(f"最佳模型参数:")
print(f"• 树的数量: {best_model.getNumTrees}")
print(f"• 树的最大深度: {best_model.getMaxDepth()}")
print(f"• 特征子集策略: {best_model.getFeatureSubsetStrategy()}")
模型评估与解释
# 1. 在测试集上预测
predictions = best_model.transform(test_data)
# 2. 评估性能
auc = evaluator.evaluate(predictions)
print(f"测试集 AUC = {auc:.4f}")
# 3. 特征重要性分析
feature_importances = best_model.featureImportances.toArray()
feature_names = feature_model.stages[-1].getInputCols()
# 创建特征重要性DataFrame
importance_df = spark.createDataFrame(
zip(feature_names, feature_importances),
["feature", "importance"]
).orderBy(col("importance").desc())
# 可视化前20个重要特征
plt.figure(figsize=(12, 8))
top_20 = importance_df.toPandas().head(20)
plt.barh(top_20["feature"], top_20["importance"])
plt.xlabel("Feature Importance")
plt.title("Top 20 Important Features")
plt.gca().invert_yaxis()
plt.show()
模型部署与预测
# 1. 保存整个Pipeline (包含特征工程和模型)
from pyspark.ml import PipelineModel
# 创建完整Pipeline
full_pipeline = Pipeline(stages=feature_pipeline.getStages() + [cv_model.bestModel])
full_model = full_pipeline.fit(flights) # 使用全量数据训练
# 保存模型
full_model.write().overwrite().save("s3a://ml-models/flight_delay_model")
# 2. 加载模型进行预测
loaded_model = PipelineModel.load("s3a://ml-models/flight_delay_model")
# 3. 模拟新数据预测
new_data = spark.createDataFrame([
("AA", "JFK", "LAX", "2015-12-24", 800, 2475.0),
("DL", "ATL", "SFO", "2015-07-15", 1400, 2139.0)
], ["AIRLINE", "ORIGIN_AIRPORT", "DESTINATION_AIRPORT",
"FL_DATE", "CRS_DEP_TIME", "DISTANCE"])
# 添加时间特征
new_data = new_data.withColumn("DEP_HOUR", hour("CRS_DEP_TIME").cast(IntegerType()))
new_data = new_data.withColumn("DAY_OF_WEEK", dayofweek("FL_DATE").cast(IntegerType()))
new_data = new_data.withColumn("MONTH", month("FL_DATE").cast(IntegerType()))
# 预测
predictions = loaded_model.transform(new_data)
predictions.select("prediction", "probability").show(truncate=False)
分布式训练核心特点分析
-
数据分片处理:
# 查看数据分区情况 print(f"训练数据分区数: {train_data.rdd.getNumPartitions()}") print(f"分区大小示例: {train_data.rdd.mapPartitions(lambda it: [sum(1 for _ in it)]).collect()[:5]}")
-
并行计算:
- 交叉验证中不同参数组合并行训练
- 随机森林中每棵树在不同节点独立训练
- 特征工程步骤在各分区并行执行
-
内存优化:
# 缓存中间结果加速迭代 train_data.persist(StorageLevel.MEMORY_AND_DISK) # 查看内存使用 print(f"缓存数据大小: {spark.sparkContext.getRDDStorageInfo()}")
-
容错机制:
- 自动重新计算丢失的分区
- 检查点机制防止长链路失败
-
资源动态分配:
# 动态分配资源 spark.conf.set("spark.dynamicAllocation.enabled", "true") spark.conf.set("spark.shuffle.service.enabled", "true")
Pipeline 核心优势总结
-
端到端封装:
# 单个Pipeline包含所有处理步骤 stages = [ StringIndexer(...), OneHotEncoder(...), VectorAssembler(...), RandomForestClassifier(...) ] pipeline = Pipeline(stages=stages)
-
训练/预测一致性:
# 训练时 model = pipeline.fit(train_data) # 预测时 (自动应用相同处理) predictions = model.transform(new_data)
-
简化部署:
# 保存/加载完整工作流 model.write().save("path/to/model") loaded_model = PipelineModel.load("path/to/model")
-
避免数据泄露:
# 特征工程统计量仅在训练集计算 indexer = StringIndexer(inputCol="AIRLINE", outputCol="AIRLINE_IDX", handleInvalid="keep") # 在Pipeline中自动处理测试集新类别
-
超参数统一调优:
# 同时优化特征处理和模型参数 param_grid = ParamGridBuilder() \ .addGrid(indexer.handleInvalid, ["keep", "skip"]) \ .addGrid(rf.maxDepth, [5, 10]) \ .build()
性能优化技巧
-
数据分区优化:
# 按航线分区提高效率 flights = flights.repartition(100, "AIRLINE")
-
特征降维:
from pyspark.ml.feature import PCA pca = PCA(k=50, inputCol="features", outputCol="pca_features")
-
采样策略:
# 分层采样处理不平衡数据 delayed = processed_data.filter(col("DELAYED") == 1) not_delayed = processed_data.filter(col("DELAYED") == 0).sample(0.2) balanced_data = delayed.union(not_delayed)
-
模型选择:
# 尝试不同算法 from pyspark.ml.classification import GBTClassifier, LogisticRegression gbt = GBTClassifier(labelCol="DELAYED", featuresCol="features", maxIter=20)
结论与扩展
通过本实践项目,我们实现了:
- 使用 Spark ML Pipeline 构建端到端机器学习工作流
- 利用分布式计算处理大规模数据集(500万+航班记录)
- 实现特征工程、模型训练、评估和部署全流程
- 展示了分布式训练的核心优势:
- 处理超大规模数据集
- 并行训练加速模型开发
- 容错机制保障长时间任务
扩展方向:
- 实时预测:集成 Structured Streaming
streaming_df = spark.readStream.schema(flights.schema).csv("s3a://real-time-flights") predictions = loaded_model.transform(streaming_df)
- 模型监控:使用 MLflow 跟踪实验
- 特征存储:构建特征库复用特征工程
- 模型解释:集成 SHAP 值分析