Spark User-guide Summary - Tunning

本文介绍如何使用Apache Spark中的交叉验证和训练-验证拆分进行超参数调整。通过实例展示了如何构建管道,设置参数网格,并利用二元分类评估器选择最佳参数组合。

Summarized from http://spark.apache.org/docs/...

Hyper-parameter Tunning

two example model selection tools are cross-validation, train-validation

example using cross-validation in a pipeline usage

tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
# pipeline 

paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()
# choose 3 choices of numFeatures, 2 choices of regParam, totally 3*2=6 choices

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

cross-validation method split the data into many train & test pieces (numFolds=2)

example of train-validation

tvs = TrainValidationSplit(estimator=lr,
       estimatorParamMaps=paramGrid,
       evaluator=RegressionEvaluator(),
       # 80% of the data will be used for training, 20% for validation.
       trainRatio=0.8)
# trainRatio is the only difference from cross-validation
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值