Labeled point
A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification. For binary classification, a label should be either 0 (negative) or 1 (positive). For multiclass classification, labels should be class indices starting from zero: 0, 1, 2, ….
标记点是与标签/响应相关联的密集或稀疏的局部矢量。在MLlib中,标记点用于监督学习算法。我们使用double来存储标签,因此我们可以在回归和分类中使用标记点。对于二进制分类,标签应为0(负)或1(正)。对于多类分类,标签应该是从零开始的类索引:0, 1, 2, …。
Python
A labeled point is represented by LabeledPoint.
标记点表示为 LabeledPoint。
Refer to the LabeledPoint Python docs for more details on the API.
有关API的更多详细信息,请参阅LabeledPointPython文档。
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
# Create a labeled point with a positive label and a dense feature vector.
pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])
# Create a labeled point with a negative label and a sparse feature vector.
neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
Sparse data稀疏数据
It is very common in practice to have sparse training data. MLlib supports reading training examples stored in LIBSVM format, which is the default format used by LIBSVM and LIBLINEAR. It is a text format in which each line represents a labeled sparse feature vector using the following format:
在实践中很常见的是具有稀疏的训练数据。MLlib支持读取以LIBSVM格式存储的训练样例,格式是LIBSVM和 使用的默认格式 LIBLINEAR。它是一种文本格式,其中每一行使用以下格式表示标记的稀疏特征向量:
label index1:value1 index2:value2 ...
where the indices are one-based and in ascending order. After loading, the feature indices are converted to zero-based.
其中索引是一个从1 开始升序的。加载后,要素索引将转换为从零开始。
Python
MLUtils.loadLibSVMFile reads training examples stored in LIBSVM format.
MLUtils.loadLibSVMFile 读取以LIBSVM格式存储的训练样例。
Refer to the MLUtils Python docs for more details on the API.
有关API的更多详细信息,请参阅MLUtilsPython文档。
from pyspark.mllib.util import MLUtils
examples = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
内容来自http://spark.apache.org/docs/latest/mllib-data-types.html
博客介绍了MLlib中标记点和稀疏数据相关内容。标记点是与标签关联的局部矢量,用于监督学习算法,可用于回归和分类。对于不同分类,标签有不同取值规则。MLlib支持读取LIBSVM格式的稀疏训练数据,加载后特征索引会转换为从零开始。
2782

被折叠的 条评论
为什么被折叠?



