Spark User-guide Summary - Tunning

最新推荐文章于 2025-09-13 12:54:57 发布

转载最新推荐文章于 2025-09-13 12:54:57 发布 · 84 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：https://segmentfault.com/a/1190000013558587

文章标签：

#大数据

本文介绍如何使用Apache Spark中的交叉验证和训练-验证拆分进行超参数调整。通过实例展示了如何构建管道，设置参数网格，并利用二元分类评估器选择最佳参数组合。

Summarized from http://spark.apache.org/docs/...

Hyper-parameter Tunning

two example model selection tools are cross-validation, train-validation

example using cross-validation in a pipeline usage

tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
# pipeline 

paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()
# choose 3 choices of numFeatures, 2 choices of regParam, totally 3*2=6 choices

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

cross-validation method split the data into many train & test pieces (numFolds=2)

example of train-validation

tvs = TrainValidationSplit(estimator=lr,
       estimatorParamMaps=paramGrid,
       evaluator=RegressionEvaluator(),
       # 80% of the data will be used for training, 20% for validation.
       trainRatio=0.8)
# trainRatio is the only difference from cross-validation