Spark ml 官方文档 - ML Pipelines

Spark MLlib 中的 Transformer 包括特征转换器和学习模型,实现 transform() 方法用于 DataFrame 的转换。Estimator 表征学习算法,实现 fit() 方法以 DataFrame 为基础训练模型。Pipeline 是一系列 Transformer 和 Estimator 阶段的序列,依次运行进行数据转换和模型训练。PipelineModel 是 Pipeline 的训练结果,用于测试阶段,确保训练和测试数据经历相同处理步骤。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Transformer: A Transformer is an abstraction that includes feature transformers and learned models. Technically, a Transformer implements a method transform(), which converts one DataFrame into another, generally by appending one or more columns.
Estimator: An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. Technically, an Estimator implements a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer.
Pipeline:A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. For Transformer stages, the transform() method is called on the DataFrame. For Estimator stages, the fit() method is called to produce a Transformer, and that Transformer’s transform() method is called on the DataFrame.
在这里插入图片描述
The first two (Tokenizer and HashingTF) are Transformers (blue), and the third (LogisticRegression) is an Estimator (red). Since LogisticRegression is an Estimator, the Pipeline first calls LogisticRegression.fit() to produce a LogisticRegressionModel. If the Pipeline had more Estimators, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage.
A Pipeline is an Estimator. Thus, after a Pipeline’s fit() method runs, it produces a PipelineModel, which is a Transformer. This PipelineModel is used at test time.
在这里插入图片描述
In the figure above, the PipelineModel has the same number of stages as the original Pipeline, but all Estimators in the original Pipeline have become Transformers.
Pipelines and PipelineModels help to ensure that training and test data go through identical feature processing steps.

Parameters:
A Param is a named parameter with self-contained documentation. A ParamMap is a set of (parameter, value) pairs.There are two main ways to pass parameters to an algorithm:

  1. Set parameters for an instance. E.g., lr.setMaxIter(10)
  2. Pass a ParamMap to fit() or transform(). Any parameters in the ParamMap will override parameters previously specified via setter methods.

Parameters belong to specific instances of Estimators and Transformers. For example, ParamMap(lr1.maxIter = 10, lr2.maxIter = 20).

lr = LogisticRegression(maxIter=10, regParam=0.01)
# Print out the parameters, documentation, and any default values.

## lr.explainParams()##model.extractParamMap()

print("LogisticRegression parameters:\n" + lr.explainParams() + "\n")
print(model1.extractParamMap())

paramMap = {lr.maxIter: 20}
paramMap.update({lr.regParam: 0.1, lr.threshold: 0.55})  # Specify multiple Params.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值