spark pipeline原理学习和记录

原创

已于 2024-06-04 15:11:23 修改 · 1.2w 阅读

14 ·

CC 4.0 BY-SA版权

文章标签：

#spark #workflow

于 2017-03-24 13:29:11 首次发布

本文介绍了Spark Pipeline的概念，包括DataFrame、转换器和预测器。DataFrame是机器学习数据集，转换器如模型可将DataFrame转换，而预测器用于训练DataFrame并产生模型。Pipeline则串联多个转换器和预测器，形成工作流。此外，文章还讨论了Pipeline组件的属性、参数及其工作原理。

概念

MLlib提供标准的机器学习算法API，能够方便的将不同的算法组合成一个独立的管道，或者叫工作流。
• DataFrame:ML API使用Sark SQL中的DataFrme作为机器学习数据集,可容纳各种类型的数据，如DataFrame可能是存储文本的不同列,特征向量,真正的标签或者预测。　　　　
• 转换器:Transformer是一种算法,可以将一个DataFrame转换成另一个DataFrame。如机器学习模型是一个转换器，可以将特征向量的DataFrame转换成预测结果的DataFrame。
• 预测器:一个预测是一个算法,可以基于DataFrame产出一个转换器。如机器学习算法是一种预测,训练DataFrame并产生一个模型。　　　　
• 管道/工作流:管道链接多个转换器和预测器生成一个机器学习工作流。　　　
• 参数:所有的转换器和预测器共享一个通用的API指定参数。
MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow. This section covers the key concepts introduced by the Pipelines API, where the pipeline concept is mostly inspired by the scikit-learn project.

• DataFrame: This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions.
• Transformer: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions.
• Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.
• Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.
• Parameter: All Transformers and Estimators now share a common API for specifying parameters.

DataFrame

机器学习可以处理多种类型的数据，比如矢量/文本/图像和结构化数据，这里DataFrame API源于Spark SQL，主要用来处理各种类型的数据。
DataFrame支持简单的和结构化类型，同时支持ML中常用的vector，可以从规则的RDD中显示或者隐式的构建。
Machine learning can be applied to a wide variety of data types, such as vectors, text, images, and structured data. This API adopts the DataFrame from Spark SQL in order to support a variety of data types.
DataFrame supports many basic and structured types; see the Spark SQL datatype reference for a list of supported types. In addition to the types listed in the Spark SQL guide, DataFrame can use ML Vector types.
A DataFrame can be created either implicitly or explicitly from a regular RDD. See the code examples below and the Spark SQL programming guide for examples.
Columns in a DataFrame are named.