Spark CountVectorizer处理文本特征

最新推荐文章于 2025-07-02 15:22:19 发布

原创最新推荐文章于 2025-07-02 15:22:19 发布 · 1.2w 阅读

1 ·

CC 4.0 BY-SA版权

spark 专栏收录该内容

7 篇文章

订阅专栏

本文介绍了一个使用CountVectorizer算法的具体例子，展示了如何将文本数据转换为数值特征向量，并通过一个简单的示例数据集来说明这一过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

博主简介：风雪夜归子（Allen），机器学习算法攻城狮，喜爱钻研Meachine Learning的黑科技，对Deep Learning和Artificial Intelligence充满兴趣，经常关注Kaggle数据挖掘竞赛平台，对数据、Machine Learning和Artificial Intelligence有兴趣的童鞋可以一起探讨哦，个人优快云博客：http://blog.youkuaiyun.com/u013719780?viewmode=contents

CountVectorizer算法是将文本向量转换成稀疏表示打数值向量（字符频率向量）。该数值向量可以传递给其他算法，譬如LDA 。在fitting过程中，CountVectorizer将会把频率高的单词排在前面。可选参数minDF表示文本中必须出现的次数。下面看一个具体的例子。

from pyspark.ml.feature import CountVectorizer

# Input data: Each row is a bag of words with a ID.
df = sqlContext.createDataFrame([
    (0, "a b c".split(" ")),
    (1, "a b b c a".split(" "))
], ["id", "words"])

# fit a CountVectorizerModel from the corpus.
cv = CountVectorizer(inputCol="words", outputCol="features", vocabSize=3, minDF=2.0)
model = cv.fit(df)
result = model.transform(df)
result.show()

+---+---------------+--------------------+
| id|          words|            features|
+---+---------------+--------------------+
|  0|      [a, b, c]|(3,[0,1,2],[1.0,1...|
|  1|[a, b, b, c, a]|(3,[0,1,2],[2.0,2...|
+---+---------------+--------------------+

from pyspark.ml.feature import CountVectorizer

# Input data: Each row is a bag of words with a ID.
df = sqlContext.createDataFrame([
    (0, "a b c".split(" ")),
    (1, "a b b c a".split(" "))
], ["id", "words"])

# fit a CountVectorizerModel from the corpus.
cv = CountVectorizer(inputCol="words", outputCol="features", vocabSize=3, minDF=2.0)
model = cv.fit(df)
result = model.transform(df)
result.show()

+---+---------------+--------------------+
| id|          words|            features|
+---+---------------+--------------------+
|  0|      [a, b, c]|(3,[0,1,2],[1.0,1...|
|  1|[a, b, b, c, a]|(3,[0,1,2],[2.0,2...|
+---+---------------+--------------------+