主要是利用 sklearn的 TfidfVectorizer(fromsklearn.feature_extraction.textimportTfidfVectorizer)对文章进行
词典的提取,并对文章根据提取的词典利用tfidf原理,对文章进行向量空间的表示
''' min_df:的含义 min_df is used for removing terms that appear too infrequently. For example: •min_df = 0.01 means "ignore terms that appear in less than 1% of the documents". •min_df = 5 means "ignore terms that appear in less than 5 documents". The default min_df is 1, which means "ignore terms that appear in less than 1 document".Thus, the default setting does not ignore any terms. max_df:的含义 max_df is used for removing terms that appear too frequently, also known as "corpus-specific stop words". For example: •max_df = 0.50 means "ignore terms that appear in more than 50% of the documents". •max_df = 25 means "ignore terms that appear in more than 25 documents". The default max_df is 1.0, which means "ignore terms that appear in more than 100% of the documents". Thus, the default setting does not ignore any terms. '''
# coding=utf-8
mydoclist = [u'温馨 提示 : 家庭 畅享 套餐 介绍 、 主卡 添加 / 取消 副 卡 短信 办理 方式 , 可 点击 文档 左上方 短信 图标 即可 将 短信 指令 发送给 客户',
u'客户 申请 i 我家 , 家庭 畅享 计划 后 , 可 选择 设置 1 - 6 个 同一 归属 地 的 中国移动 网 内 号码 作为 亲情 号码 , 组建 一个 家庭 亲情 网 家庭 内 ',
u'所有 成员 可 享受 本地 互打 免费 优惠 , 家庭 主卡 号码 还 可 享受 省内 / 国内 漫游 接听 免费 的 优惠']
from sklearn.feature_extraction.text import CountVectorizer
# count_vectorizer = CountVectorizer(min_df=1)
# term_freq_matrix = count_vectorizer.fit_transform(mydoclist)
# print "Vocabulary:", count_vectorizer.vocabulary_
#
# from sklearn.feature_extraction.text import TfidfTransformer
#
# tfidf = TfidfTransformer(norm="l2")
# tfidf.fit(term_freq_matrix)
#
# tf_idf_matrix = tfidf.transform(term_freq_matrix)
# print tf_idf_matrix.todense()
# from __future__ import print_function
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df = 1)
tfidf_matrix = tfidf_vectorizer.fit_transform(mydoclist)
str=''
for i in tfidf_vectorizer.vocabulary_:
str+=' '+i
print str
print tfidf_matrix.todense()
new_docs = [u'一个 客户 号码 只能 办理 一种 家庭 畅享 计划 套餐 , 且 只能 加入 一个 家庭网']
new_term_freq_matrix = tfidf_vectorizer.transform(new_docs)
print tfidf_vectorizer.vocabulary_,type(tfidf_vectorizer.vocabulary_)
str=''
for i,j in sorted(tfidf_vectorizer.vocabulary_.items(), key=lambda d: d[1]):
str+=' '+i
print str
print [ v for v in sorted(tfidf_vectorizer.vocabulary_.values())]
print sorted(tfidf_vectorizer.vocabulary_.items(), key=lambda d: d[1])
print new_term_freq_matrix.todense()