sklearn 谱聚类与文本挖掘初步（二）-优快云博客

本文介绍了如何使用sklearn进行文本挖掘，通过TfidfVectorizer处理文本并去除停用词。接着，展示了使用SpectralCoclustering进行双聚类，并探讨了不同SVD方法对精度的影响。最后，通过MiniBatchKMeans进行聚类，并比较了V-measure得分。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

在tf-idf概念中首先出现的是词频（Term frequency == TF）
之后又有（inverse document frequency == IDF）,这是一个用于防止类似过拟合的因子，
这里过拟合的概念是指一些常出现的一些没有实际意义的词汇，类似于前面提到
的停词（stop_word）用于减少这些词汇的相对重要性。

tf-idf被定义为Term frequency 与 inverse document frequency的乘积，这种乘积是
一种广义的概念，可以相应的定义一类tf-idf统计量。

在一般的tf-idf算法中会先使用stop_words 将一些停用词去掉，
sklearn.feature_extraction.text. TfidfVectorizer
有arg stop_words_ 其type为set可以先对set中的词实现过滤。
此函数的的几个参数进行说明：

TfidfVectorizer(stop_words='english', min_df=5,
tokenizer=number_aware_tokenizer)

stop_words使用英语停词集合先实现文本过滤，min_df实现阈值，在此阈值之下
的词频词都会被预过滤掉,int指词频 float值词的比率。（相对的有max_df也是
指高于词频的词进行过滤）

下面的例子在于使用非监督学习的方式，对于一个已分类数据实现双聚类算法，
看双聚类算法在文本挖掘中的准确性。

当使用双聚类函数时有一些细节是值得注意的，

sklearn.cluster.bicluster.SpectralCoclustering
有arg svd_method 默认为randomized 是随机方法，精度可能降低，
当改为arpack时可以提高精度，但计算较慢。

python 指定格式化位数的方式为 {:.f2}为格式化为两位浮点数，注意这里:不能省略。

下面为示例代码：

from __future__ import print_function 
from collections import defaultdict 
import operator 
import re 
from time import time 


import numpy as np 
from sklearn.cluster.bicluster import SpectralCoclustering 
from sklearn.cluster import MiniBatchKMeans 
from sklearn.externals.six import iteritems 
from sklearn.datasets.twenty_newsgroups import fetch_20newsgroups 
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics.cluster import v_measure_score 


def number_aware_tokenizer(doc):
token_pattern = re.compile(u"(?u)\\b\\w\\w+\\b")
tokens = token_pattern.findall(doc)
tokens = ["#NUMBER" if token[0] in "0123456789_" else token for token in tokens]


return tokens




categories = ['alt.atheism', 'comp.graphics',
'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
'comp.windows.x', 'misc.forsale', 'rec.autos',
'rec.motorcycles', 'rec.sport.baseball',
'rec.sport.hockey', 'sci.crypt', 'sci.electronics',
'sci.med', 'sci.space', 'soc.religion.christian',
'talk.politics.guns', 'talk.politics.mideast',
'talk.politics.misc', 'talk.religion.misc']


newsgroups = fetch_20newsgroups(categories = categories)
y_true = newsgroups.target 


vectorizer = TfidfVectorizer(stop_words = "english", min_df = 5, tokenizer = number_aware_tokenizer)
cocluster = SpectralCoclustering(n_clusters = len(categories), svd_method = 'arpack', random_state = 0)
kmeans = MiniBatchKMeans(n_clusters = len(categories), batch_size = 20000, random_state = 0)


print ("vectorizing...")
X = vectorizer.fit_transform(newsgroups.data)


print ("Coclustering...")
start_time = time()
cocluster.fit(X)
y_cocluster = cocluster.row_labels_
print ("Done in {:.2f}s. V-measure: {:.4f}".format(time() - start_time, v_measure_score(y_cocluster, y_true)))


print ("MiniBatchKMeans...")
start_time = time()
y_kmeans = kmeans.fit_predict(X)
print ("Done in {:.2f}s. V-measure: {:.4f}".format(time() - start_time, v_measure_score(y_kmeans, y_true)))