PySpark MLlib 特征处理详解
PySpark MLlib 提供了丰富的特征处理工具,帮助我们进行特征提取、转换和选择。以下是 PySpark MLlib 中常用的特征处理类及其简要介绍。
1. Binarizer
Binarizer 是将连续特征二值化的转换器。
from pyspark.ml.feature import Binarizer
binarizer = Binarizer(threshold=0.5, inputCol="feature", outputCol="binarized_feature")
binarizedData = binarizer.transform(data)
2. BucketedRandomProjectionLSH
BucketedRandomProjectionLSH 是基于欧几里得距离度量的 LSH 类。
from pyspark.ml.feature import BucketedRandomProjectionLSH
brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes", bucketLength=2.0)
model = brp.fit(data)
transformedData = model.transform(data)
3. Bucketizer
Bucketizer 将连续特征映射到特征桶。
from pyspark.ml.feature import Bucketizer
splits = [-float("inf"), 0.0, float("inf")]
bucketizer = Bucketizer(splits=splits, inputCol="feature", outputCol="bucketed_feature")
bucketedData = bucketizer.transform(data)
4. ChiSqSelector
ChiSqSelector 是卡方特征选择器,选择预测分类标签的分类特征。
from pyspark.ml.feature import ChiSqSelector
selector = ChiSqSelector(numTopFeatures=50, featuresCol="features", labelCol="label", outputCol="selected_features")
result = selector.fit(data).transform(data)
5. CountVectorizer
CountVectorizer 从文档集合中提取词汇,并生成 CountVectorizerModel。
from pyspark.ml.feature import CountVectorizer
cv = CountVectorizer(inputCol="text", outputCol="features", vocabSize=10000, minDF=5)
model = cv.fit(data)
vectorizedData = model.transform(data)
6. DCT
DCT 是对实数向量进行一维离散余弦变换的特征转换器。
from pyspark.ml.feature import DCT
dct = DCT(inverse=False, inputCol="features", outputCol="dct_features")
dctData = dct.transform(data)
7. ElementwiseProduct
ElementwiseProduct 对每个输入向量与提供的“权重”向量进行 Hadamard 乘积(即逐元素乘积)。
from pyspark.ml.feature import ElementwiseProduct
from pyspark.ml.linalg import Vectors
scalingVec = Vectors.dense([0.0, 1.0, 2.0])
transformer = ElementwiseProduct(scalingVec=scalingVec, inputCol="features", outputCol="scaled_features")
scaledData = transformer.transform(data)
8. FeatureHasher
FeatureHasher 将一组分类或数值特征投影到指定维度的特征向量中。
from pyspark.ml.feature import FeatureHasher
hasher = FeatureHasher(inputCols=["cat1", "cat2", "num1"], outputCol="features")
hashedData = hasher.transform(data)
9. HashingTF
HashingTF 使用哈希技巧将词序列映射到它们的词频。
from pyspark.ml.feature import HashingTF
hashingTF = HashingTF(inputCol="text", outputCol="features", numFeatures=10000)
tfData = hashingTF.transform(data)
10. IDF
IDF 计算文档集合的逆文档频率(IDF)。
from pyspark.ml.feature import IDF
idf = IDF(inputCol="raw_features", outputCol="features", minDocFreq=5)
model = idf.fit(tfData)
tfidfData = model.transform(tfData)
11. Imputer
Imputer 使用列中的均值、中位数或众数来填补缺失值。
from pyspark.ml.feature import Imputer
imputer = Imputer(inputCols=["feature1", "feature2"], outputCols=["imputed_feature1", "imputed_feature2"])
model = imputer.fit(data)
imputedData = model.transform(data)
12. IndexToString
IndexToString 将索引列映射回相应的字符串值列。
from pyspark.ml.feature import IndexToString
converter = IndexToString(inputCol="index", outputCol="string", labels=["a", "b", "c"])
convertedData = converter.transform(data)
13. Interaction
Interaction 实现特征交互转换。
from pyspark.ml.feature import Interaction
interaction = Interaction(inputCols=["col1", "col2"], outputCol="interacted_col")
interactedData = interaction.transform(data)
14. MaxAbsScaler
MaxAbsScaler 通过除以每个特征的最大绝对值来单独缩放每个特征到范围 [-1, 1]。
from pyspark.ml.feature import MaxAbsScaler
scaler = MaxAbsScaler(inputCol="features", outputCol="scaled_features")
model = scaler.fit(data)
scaledData = model.transform(data)
15. MinHashLSH
MinHashLSH 是基于 Jaccard 距离的 LSH 类。
from pyspark.ml.feature import MinHashLSH
mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=3)
model = mh.fit(data)
transformedData = model.transform(data)
16. MinMaxScaler
MinMaxScaler 使用列摘要统计数据,将每个特征单独线性缩放到 [min, max] 范围内,也称为最小-最大归一化或重缩放。
from pyspark.ml.feature import MinMaxScaler
scaler = MinMaxScaler(inputCol="features", outputCol="scaled_features")
model = scaler.fit(data)
scaledData = model.transform(data)
17. NGram
NGram 是一个特征转换器,它将输入的字符串数组转换为 n-grams 数组。
from pyspark.ml.feature import NGram
ngram = NGram(n=2, inputCol="words", outputCol="ngrams")
ngramData = ngram.transform(data)
18. Normalizer
Normalizer 使用给定的 p-范数将向量规范化为单位范数。
from pyspark.ml.feature import Normalizer
normalizer = Normalizer(p=1.0, inputCol="features", outputCol="norm_features")
normData = normalizer.transform(data)
19. OneHotEncoder
OneHotEncoder 将分类索引列映射到二进制向量列。