sklearn.feature_extraction.text.CountVector

最新推荐文章于 2024-12-31 16:24:20 发布

原创

最新推荐文章于 2024-12-31 16:24:20 发布 · 3.8k 阅读

11 ·

CC 4.0 BY-SA版权

文章标签：

#sklearn.feature_extraction.tex #CountVector #python #文本特征提取

本文介绍了sklearn.feature_extraction.text.CountVectorizer的参数、使用示例和2-grams的概念。该方法用于文本特征提取，包括与其他方法如TfidfVectorizer的对比。文中详细解释了token_pattern参数，并给出了1-grams和2-grams的使用区别。

1，参数

sklearn.feature_extraction.text.CountVector是sklearn.feature_extraction.text提供的文本特征提取方法的一种。

sklearn.feature_extraction.text 的4中文本特征提取方法：

CounterVector
TfidfVectorizer
TfidfTransformer
HashingVectorizer

看看这个函数的参数：

sklearn.feature_extraction.text.CountVectorizer(
input=’content’,         #输入，可以是文件名字，文件，文本内容
encoding=’utf-8’,       #默认编码方式
decode_error=’strict’, # 编码错误的处理方式，有三种{'strict','ignore','replace}
strip_accents=None, # 去除音调，三种{'ascill','unicode',None},ascii处理的速度快，但只适用于ASCll编码，unicode适用于所有的字符，但速度慢
lowercase=True, # 转化为小写
preprocessor=None,
tokenizer=None, #
stop_words=None,
token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1),
analyzer=’word’, #停止词，一些特别多，但没有意义的词，例如 a ,the an
max_df=1.0,#
min_df=1, #词最少出现的次数
max_features=None,  #最大特征
vocabulary=None,
binary=False,
dtype=<class ‘numpy.int64’>)