tf-idf理解与使用

一、tf-idf总结

是由tf和idf两部分相乘得到

  • 1.tf该个句子里面各个单词的词频,词频越高反应的是这个句子对这个词特别看重,讲的主旨应该也是跟这个有关。
  • 2.idf统计的是log10语料库句子总数/包含该词组的句子的个数log_{10}^{语料库句子总数/包含该词组的句子的个数}log10/,反应的是这个词组重不重要,因为这个词组在所有句子都出现的话,那么肯定不重要了。

二、公式

  • tf=一句话中单词出现的次数/单词总数tf=一句话中单词出现的次数/单词总数tf=/
  • idf=log10语料库句子总数/包含该词组的句子的个数idf=log_{10}^{语料库句子总数/包含该词组的句子的个数}idf=log10/

三、使用

3.1直接可以得到结果
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())
print(X.shape)

结果

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
(4, 9)
3.2 不同n-gram
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
]
vectorizer = TfidfVectorizer(ngram_range=(1,5))
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

结果

['and', 'and this', 'and this is', 'and this is the', 'and this is the third', 'document', 'document is', 'document is the', 'document is the second', 'document is the second document', 'first', 'first document', 'is', 'is the', 'is the first', 'is the first document', 'is the second', 'is the second document', 'is the third', 'is the third one', 'is this', 'is this the', 'is this the first', 'is this the first document', 'one', 'second', 'second document', 'the', 'the first', 'the first document', 'the second', 'the second document', 'the third', 'the third one', 'third', 'third one', 'this', 'this document', 'this document is', 'this document is the', 'this document is the second', 'this is', 'this is the', 'this is the first', 'this is the first document', 'this is the third', 'this is the third one', 'this the', 'this the first', 'this the first document']
3.3 字节为单位
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
]
vectorizer = TfidfVectorizer(analyzer='char_wb')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

结果

[' ', '.', '?', 'a', 'c', 'd', 'e', 'f', 'h', 'i', 'm', 'n', 'o', 'r', 's', 't', 'u']
3.4字节为单位n-gram
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
]
vectorizer = TfidfVectorizer(analyzer='char_wb',ngram_range=(2,5))
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names(),X.toarray().shape)

结果

[' a', ' an', ' and', ' and ', ' d', ' do', ' doc', ' docu', ' f', ' fi', ' fir', ' firs', ' i', ' is', ' is ', ' o', ' on', ' one', ' one.', ' s', ' se', ' sec', ' seco', ' t', ' th', ' the', ' the ', ' thi', ' thir', ' this', '. ', '? ', 'an', 'and', 'and ', 'co', 'con', 'cond', 'cond ', 'cu', 'cum', 'cume', 'cumen', 'd ', 'do', 'doc', 'docu', 'docum', 'e ', 'e.', 'e. ', 'ec', 'eco', 'econ', 'econd', 'en', 'ent', 'ent ', 'ent.', 'ent. ', 'ent?', 'ent? ', 'fi', 'fir', 'firs', 'first', 'he', 'he ', 'hi', 'hir', 'hird', 'hird ', 'his', 'his ', 'ir', 'ird', 'ird ', 'irs', 'irst', 'irst ', 'is', 'is ', 'me', 'men', 'ment', 'ment ', 'ment.', 'ment?', 'nd', 'nd ', 'ne', 'ne.', 'ne. ', 'nt', 'nt ', 'nt.', 'nt. ', 'nt?', 'nt? ', 'oc', 'ocu', 'ocum', 'ocume', 'on', 'ond', 'ond ', 'one', 'one.', 'one. ', 'rd', 'rd ', 'rs', 'rst', 'rst ', 's ', 'se', 'sec', 'seco', 'secon', 'st', 'st ', 't ', 't.', 't. ', 't?', 't? ', 'th', 'the', 'the ', 'thi', 'thir', 'third', 'this', 'this ', 'um', 'ume', 'umen', 'ument'] (4, 138)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值