改进向量空间模型

本文介绍了一种改进的向量空间模型(VSM),通过引入TF-IDF来优化文档相似度计算,解决了词频分布不均的问题,并提供了一个C#实现。
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style=""><span style="font-size: small;">声明:只是对向量空间模型的介绍(或者叫推广),并没有理论创新工作。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style="font-size: small;"><span style="">本人在之前的《</span><span lang="EN-US"><a href="http://blog.youkuaiyun.com/Felomeng/archive/2009/03/25/4024078.aspx" target="_blank"><span style="" lang="EN-US"><span lang="EN-US">向量空间模型</span></span><span style="font-family: Calibri;">(VSM)</span><span style="" lang="EN-US"><span lang="EN-US">在文档相似度计算上的简单介绍</span></span></a></span><span class="title"><span style="">》和《</span></span><span lang="EN-US"><a href="http://blog.youkuaiyun.com/Felomeng/archive/2009/03/25/4023990.aspx" target="_blank"><span style="" lang="EN-US"><span lang="EN-US">向量空间模型文档相似度计算实现(</span></span><span style="font-family: Calibri;">C#</span><span style="" lang="EN-US"><span lang="EN-US">)</span></span></a></span><span class="title"><span style="">》两篇文章中分别介绍了简单</span><span lang="EN-US"><span style="font-family: Calibri;">SVM</span></span></span><span class="title"><span style="">模型及其实现。</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span class="title"><span style="font-size: small;"><span style="">本人使用简单词频(即词在当前文档中出现的次数)信息,实现了一个朴素版本的向量空间模型,效果尚可,但还是有很多可改进之处。</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style="font-size: small;"><span class="title"><span style="">直接使用词的个数在比较词数很多和词数很少的文档时存在着问题。例如文档</span><span lang="EN-US"><span style="font-family: Calibri;">I</span></span></span><span class="title"><span style="">中含有</span><span lang="EN-US"><span style="font-family: Calibri;">10000</span></span></span><span class="title"><span style="">个词,而词</span><span lang="EN-US"><span style="font-family: Calibri;">a</span></span></span><span class="title"><span style="">出现了</span><span lang="EN-US"><span style="font-family: Calibri;">10</span></span></span><span class="title"><span style="">次;文档</span><span lang="EN-US"><span style="font-family: Calibri;">II</span></span></span><span class="title"><span style="">中含有</span><span lang="EN-US"><span style="font-family: Calibri;">100</span></span></span><span class="title"><span style="">个词,而</span><span lang="EN-US"><span style="font-family: Calibri;">a</span></span></span><span class="title"><span style="">出现了</span><span lang="EN-US"><span style="font-family: Calibri;">5</span></span></span><span class="title"><span style="">次。这样在相似度计算时,文档</span><span lang="EN-US"><span style="font-family: Calibri;">I</span></span></span><span class="title"><span style="">中</span><span lang="EN-US"><span style="font-family: Calibri;">a</span></span></span><span class="title"><span style="">对最后结果的影响比文档</span><span lang="EN-US"><span style="font-family: Calibri;">II</span></span></span><span class="title"><span style="">中的</span><span lang="EN-US"><span style="font-family: Calibri;">a</span></span></span><span class="title"><span style="">要大。这显然是不合理的,因为</span><span lang="EN-US"><span style="font-family: Calibri;">a</span></span></span><span class="title"><span style="">只点文档</span><span lang="EN-US"><span style="font-family: Calibri;">I</span></span></span><span class="title"><span style="">的</span><span lang="EN-US"><span style="font-family: Calibri;">0.1%</span></span></span><span class="title"><span style="">而却占文档</span><span lang="EN-US"><span style="font-family: Calibri;">II</span></span></span><span class="title"><span style="">的</span><span lang="EN-US"><span style="font-family: Calibri;">5%</span></span></span><span class="title"><span style="">。为了解决这类问题,我们引入词频(</span><span lang="EN-US"><span style="font-family: Calibri;">TF</span></span></span><span class="title"><span style="">)和反词频(</span><span lang="EN-US"><span style="font-family: Calibri;">IDF</span></span></span><span class="title"><span style="">)两个概念。</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style="font-size: small;"><span class="title"><span style="">其中</span><span lang="EN-US"><span style="font-family: Calibri;">TF = f/m</span></span></span><span class="title"><span style="">,其中</span><span lang="EN-US"><span style="font-family: Calibri;">f</span></span></span><span class="title"><span style="">表示当前词在当前文档中出现的次数,而</span><span lang="EN-US"><span style="font-family: Calibri;">m</span></span></span><span class="title"><span style="">表示当前文档中出现次数最多的词的次数。这样</span><span lang="EN-US"><span style="font-family: Calibri;">TF</span></span></span><span class="title"><span style="">值就在</span><span lang="EN-US"><span style="font-family: Calibri;">0</span></span></span><span class="title"><span style="">和</span><span lang="EN-US"><span style="font-family: Calibri;">1</span></span></span><span class="title"><span style="">之间。这样做可以减少文档中词的频率不合理分布所引起的误差。</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style="font-size: small;"><span style="font-family: Calibri;"><span class="title"><span lang="EN-US">IDF = </span></span><span lang="EN-US">log<sub>2 </sub>(<em>n</em>/<em>n<sub>j</sub></em>) + 1</span></span><span style="">,其中</span><span lang="EN-US"><span style="font-family: Calibri;">n</span></span><span style="">表示在整个语料中文档的总数,而</span><span lang="EN-US"><span style="font-family: Calibri;">n<sub>j</sub></span></span><span style="">表示含有当前词的文档数。这样做可以减少在语料范围内词频分布不均匀造成的相似度误差。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style="font-size: small;"><span style=""><img src="https://p-blog.youkuaiyun.com/images/p_blog_youkuaiyun.com/Felomeng/EntryImages/20090409/tfidf.JPG" alt="" width="457" height="117"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style="font-size: small;"><span style="">最后,将这两项相乘得到</span><span lang="EN-US"><span style="font-family: Calibri;">T = TF * IDF</span></span><span style="">,用这个量替代《</span><span lang="EN-US"><a href="http://blog.youkuaiyun.com/Felomeng/archive/2009/03/25/4024078.aspx" target="_blank"><span style="" lang="EN-US"><span lang="EN-US">向量空间模型</span></span><span style="font-family: Calibri;">(VSM)</span><span style="" lang="EN-US"><span lang="EN-US">在文档相似度计算上的简单介绍</span></span></a></span><span class="title"><span style="">》中的简单词频,就可以得到实际应用中常用的向量空间模型了。</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style="font-size: small;"></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style="font-size: small;"><span class="title"><span style=""><img src="https://p-blog.youkuaiyun.com/images/p_blog_youkuaiyun.com/Felomeng/EntryImages/20090409/cos.JPG" alt="" width="402" height="202"></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style="font-size: small;"></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style="font-size: small;"><span class="title"><span style="">另外还对<a href="http://download.youkuaiyun.com/source/1143450">原向量空间模型</a>的源代码进行了优化和改进(主要是空间换时间策略),可以从<a href="http://download.youkuaiyun.com/source/1191463">这里</a>下载。</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; text-indent: 21.2pt;"><span style="font-size: small;"></span></p>
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值