向量空间模型文档相似度计算实现(C#)

本文介绍了一种基于向量空间模型(VSM)的文本相似度计算方法,并提供了具体的C#实现代码。该方法首先统计文档的词频,然后通过余弦相似度公式计算两篇文档之间的相似度。
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="color: #000000;"><span style="font-size: small;">读者可以根据自己的需要进行加壳或改写,本文权当抛砖引玉。</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="color: #000000;"><span style="font-size: small;">笔者加的壳在:</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="color: #000000;"><span style="font-size: small;"><span style="text-decoration: underline;"><span style="color: #800080;"><a href="http://download.youkuaiyun.com/source/1143450">http://download.youkuaiyun.com/source/1143450</a></span></span><a href="http://download.youkuaiyun.com/source/1143450"></a></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="color: #000000;"><span style="font-size: small;">VSM模型介绍:</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="color: #000000;"><span style="font-size: small;"><span style="color: #0000ff;"><a href="http://blog.youkuaiyun.com/Felomeng/archive/2009/03/25/4024078.aspx">http://blog.youkuaiyun.com/Felomeng/archive/2009/03/25/4024078.aspx</a></span><a href="http://blog.youkuaiyun.com/Felomeng/archive/2009/03/25/4023944.aspx"></a></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style="font-size: small;"><span style="">using</span><span style=""> <span style="color: #010001;">System</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style="font-size: small;"><span style="">using</span><span style=""> <span style="color: #010001;">System</span>.<span style="color: #010001;">Collections</span>.<span style="color: #010001;">Generic</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style="font-size: small;"><span style="">using</span><span style=""> <span style="color: #010001;">System</span>.<span style="color: #010001;">Linq</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style="font-size: small;"><span style="">using</span><span style=""> <span style="color: #010001;">System</span>.<span style="color: #010001;">Text</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style="font-size: small;"><span style="">using</span><span style=""> <span style="color: #010001;">System</span>.<span style="color: #010001;">Text</span>.<span style="color: #010001;">RegularExpressions</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style="font-size: small;"><span style="">namespace</span><span style=""> <span style="color: #010001;">Felomeng</span>.<span style="color: #010001;">VSMSimilarity</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;">{</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">class</span> <span style="color: #2b91af;">SVMModle</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>{</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> </span><span style="color: gray;"><summary></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> <span lang="ZH-CN">降维词表</span></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> </span><span style="color: gray;"></summary></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">private</span> <span style="color: #2b91af;">List</span><<span style="color: blue;">string</span>> <span style="color: #010001;">reducingKeys</span> = <span style="color: blue;">new</span> <span style="color: #2b91af;">List</span><<span style="color: blue;">string</span>>();</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> </span><span style="color: gray;"><summary></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> <span lang="ZH-CN">构造函数:使用降维表</span></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> </span><span style="color: gray;"></summary></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> </span><span style="color: gray;"><param name="reducingKeys"></span><span style="color: green;" lang="ZH-CN">降维词表</span><span style="color: gray;"></param></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">public</span> <span style="color: #010001;">SVMModle</span>(<span style="color: #2b91af;">List</span><<span style="color: blue;">string</span>> <span style="color: #010001;">reducingKeys</span>)</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>{</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">this</span>.<span style="color: #010001;">reducingKeys</span> = <span style="color: #010001;">reducingKeys</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>}</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> </span><span style="color: gray;"><summary></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> <span lang="ZH-CN">构造函数:不使用降维表</span></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> </span><span style="color: gray;"></summary></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">public</span> <span style="color: #010001;">SVMModle</span>()</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>{</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>}</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> </span><span style="color: gray;"><summary></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> <span lang="ZH-CN">相似度计算</span></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> </span><span style="color: gray;"></summary></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> </span><span style="color: gray;"><param name="text1"></span><span style="color: green;" lang="ZH-CN">文档1(分好词的,分词符为非汉字字符)</span><span style="color: gray;"></param></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> </span><span style="color: gray;"><param name="text2"></span><span style="color: green;" lang="ZH-CN">文档2(分好词的,分词符为非汉字字符)</span><span style="color: gray;"></param></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> </span><span style="color: gray;"><returns></span><span style="color: green;" lang="ZH-CN">两篇文章的相似度</span><span style="color: gray;"></returns></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">public</span> <span style="color: blue;">double</span> <span style="color: #010001;">Similarity</span>(<span style="color: blue;">string</span> <span style="color: #010001;">text1</span>, <span style="color: blue;">string</span> <span style="color: #010001;">text2</span>)</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>{</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">double</span> <span style="color: #010001;">similarity</span> = 0.0, <span style="color: #010001;">numerator</span> = 0.0, <span style="color: #010001;">denominator1</span> = 0.0, <span style="color: #010001;">denominator2</span> = 0.0;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">int</span> <span style="color: #010001;">temp1</span>, <span style="color: #010001;">temp2</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #2b91af;">Dictionary</span><<span style="color: blue;">string</span>, <span style="color: blue;">int</span>> <span style="color: #010001;">dictionary1</span> = <span style="color: #010001;">GetDictionary</span>(<span style="color: #010001;">text1</span>);</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #2b91af;">Dictionary</span><<span style="color: blue;">string</span>, <span style="color: blue;">int</span>> <span style="color: #010001;">dictionary2</span> = <span style="color: #010001;">GetDictionary</span>(<span style="color: #010001;">text2</span>);</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">if</span> ((<span style="color: #010001;">dictionary1</span>.<span style="color: #010001;">Count</span> < 1) || (<span style="color: #010001;">dictionary2</span>.<span style="color: #010001;">Count</span> < 1))<span style="color: green;">//<span lang="ZH-CN">如果任一篇文章中不含有汉字</span></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>{</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">return</span> 0.0;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>}</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #2b91af;">Dictionary</span><<span style="color: blue;">string</span>, <span style="color: blue;">int</span>>.<span style="color: #2b91af;">KeyCollection</span> <span style="color: #010001;">keys1</span> = <span style="color: #010001;">dictionary1</span>.<span style="color: #010001;">Keys</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">foreach</span> (<span style="color: blue;">string</span> <span style="color: #010001;">key</span> <span style="color: blue;">in</span> <span style="color: #010001;">keys1</span>)</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>{</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #010001;">dictionary1</span>.<span style="color: #010001;">TryGetValue</span>(<span style="color: #010001;">key</span>, <span style="color: blue;">out</span> <span style="color: #010001;">temp1</span>);</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">if</span> (!<span style="color: #010001;">dictionary2</span>.<span style="color: #010001;">TryGetValue</span>(<span style="color: #010001;">key</span>, <span style="color: blue;">out</span> <span style="color: #010001;">temp2</span>))</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>{</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #010001;">temp2</span> = 0;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>}</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style=""></span><span style="color: #010001;">dictionary2</span>.<span style="color: #010001;">Remove</span>(<span style="color: #010001;">key</span>);</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #010001;">numerator</span> += <span style="color: #010001;">temp1</span> * <span style="color: #010001;">temp2</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #010001;">denominator1</span> += <span style="color: #010001;">temp1</span> * <span style="color: #010001;">temp1</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #010001;">denominator2</span> += <span style="color: #010001;">temp2</span> * <span style="color: #010001;">temp2</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>}</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #2b91af;">Dictionary</span><<span style="color: blue;">string</span>, <span style="color: blue;">int</span>>.<span style="color: #2b91af;">KeyCollection</span> <span style="color: #010001;">keys2</span> = <span style="color: #010001;">dictionary2</span>.<span style="color: #010001;">Keys</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">foreach</span> (<span style="color: blue;">string</span> <span style="color: #010001;">key</span> <span style="color: blue;">in</span> <span style="color: #010001;">keys2</span>)</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>{</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #010001;">dictionary2</span>.<span style="color: #010001;">TryGetValue</span>(<span style="color: #010001;">key</span>, <span style="color: blue;">out</span> <span style="color: #010001;">temp2</span>);</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #010001;">denominator2</span> += <span style="color: #010001;">temp2</span> * <span style="color: #010001;">temp2</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>}</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #010001;">similarity</span> = <span style="color: #010001;">numerator</span> / (<span style="color: #2b91af;">Math</span>.<span style="color: #010001;">Sqrt</span>(<span style="color: #010001;">denominator1</span> * <span style="color: #010001;">denominator2</span>));</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style=""></span><span style="color: blue;">return</span> <span style="color: #010001;">similarity</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>}</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> </span><span style="color: gray;"><summary></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> <span lang="ZH-CN">相似度计算</span></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> </span><span style="color: gray;"></summary></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> </span><span style="color: gray;"><param name="text1"></span><span style="color: green;" lang="ZH-CN">第一篇文档的词频词典</span><span style="color: gray;"></param></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> </span><span style="color: gray;"><param name="text2"></span><span style="color: green;" lang="ZH-CN">第二篇文档的词频词典</span><span style="color: gray;"></param></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> </span><span style="color: gray;"><returns></span><span style="color: green;" lang="ZH-CN">两篇文档的相似度</span><span style="color: gray;"></returns></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">public</span> <span style="color: blue;">double</span> <span style="color: #010001;">Similarity</span>(<span style="color: #2b91af;">Dictionary</span><<span style="color: blue;">string</span>, <span style="color: blue;">int</span>> <span style="color: #010001;">text1</span>, <span style="color: #2b91af;">Dictionary</span><<span style="color: blue;">string</span>, <span style="color: blue;">int</span>> <span style="color: #010001;">text2</span>)</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>{</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">double</span> <span style="color: #010001;">similarity</span> = 0.0, <span style="color: #010001;">numerator</span> = 0.0, <span style="color: #010001;">denominator1</span> = 0.0, <span style="color: #010001;">denominator2</span> = 0.0;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">int</span> <span style="color: #010001;">temp1</span>, <span style="color: #010001;">temp2</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #2b91af;">Dictionary</span><<span style="color: blue;">string</span>, <span style="color: blue;">int</span>> <span style="color: #010001;">dictionary1</span> = <span style="color: blue;">new</span> <span style="color: #2b91af;">Dictionary</span><<span style="color: blue;">string</span>,<span style="color: blue;">int</span>>( <span style="color: #010001;">text1</span>);</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #2b91af;">Dictionary</span><<span style="color: blue;">string</span>, <span style="color: blue;">int</span>> <span style="color: #010001;">dictionary2</span> = <span style="color: blue;">new</span> <span style="color: #2b91af;">Dictionary</span><<span style="color: blue;">string</span>,<span style="color: blue;">int</span>>( <span style="color: #010001;">text2</span>);</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">if</span> ((<span style="color: #010001;">dictionary1</span>.<span style="color: #010001;">Count</span> < 1) || (<span style="color: #010001;">dictionary2</span>.<span style="color: #010001;">Count</span> < 1))<span style="color: green;">//<span lang="ZH-CN">如果任一篇文章中不含有汉字</span></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>{</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">return</span> 0.0;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>}</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #2b91af;">Dictionary</span><<span style="color: blue;">string</span>, <span style="color: blue;">int</span>>.<span style="color: #2b91af;">KeyCollection</span> <span style="color: #010001;">keys1</span> = <span style="color: #010001;">dictionary1</span>.<span style="color: #010001;">Keys</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">foreach</span> (<span style="color: blue;">string</span> <span style="color: #010001;">key</span> <span style="color: blue;">in</span> <span style="color: #010001;">keys1</span>)</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>{</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #010001;">dictionary1</span>.<span style="color: #010001;">TryGetValue</span>(<span style="color: #010001;">key</span>, <span style="color: blue;">out</span> <span style="color: #010001;">temp1</span>);</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">if</span> (!<span style="color: #010001;">dictionary2</span>.<span style="color: #010001;">TryGetValue</span>(<span style="color: #010001;">key</span>, <span style="color: blue;">out</span> <span style="color: #010001;">temp2</span>))</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>{</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #010001;">temp2</span> = 0;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>}</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #010001;">dictionary2</span>.<span style="color: #010001;">Remove</span>(<span style="color: #010001;">key</span>);</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #010001;">numerator</span> += <span style="color: #010001;">temp1</span> * <span style="color: #010001;">temp2</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #010001;">denominator1</span> += <span style="color: #010001;">temp1</span> * <span style="color: #010001;">temp1</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #010001;">denominator2</span> += <span style="color: #010001;">temp2</span> * <span style="color: #010001;">temp2</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>}</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #2b91af;">Dictionary</span><<span style="color: blue;">string</span>, <span style="color: blue;">int</span>>.<span style="color: #2b91af;">KeyCollection</span> <span style="color: #010001;">keys2</span> = <span style="color: #010001;">dictionary2</span>.<span style="color: #010001;">Keys</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">foreach</span> (<span style="color: blue;">string</span> <span style="color: #010001;">key</span> <span style="color: blue;">in</span> <span style="color: #010001;">keys2</span>)</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>{</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #010001;">dictionary2</span>.<span style="color: #010001;">TryGetValue</span>(<span style="color: #010001;">key</span>, <span style="color: blue;">out</span> <span style="color: #010001;">temp2</span>);</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #010001;">denominator2</span> += <span style="color: #010001;">temp2</span> * <span style="color: #010001;">temp2</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>}</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #010001;">similarity</span> = <span style="color: #010001;">numerator</span> / (<span style="color: #2b91af;">Math</span>.<span style="color: #010001;">Sqrt</span>(<span style="color: #010001;">denominator1</span> * <span style="color: #010001;">denominator2</span>));</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">return</span> <span style="color: #010001;">similarity</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>}</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> </span><span style="color: gray;"><summary></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> <span lang="ZH-CN">统计文档词频词典</span></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> </span><span style="color: gray;"></summary></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> </span><span style="color: gray;"><param name="text"></span><span style="color: green;" lang="ZH-CN">已分词文档,分隔符为非汉语字符</span><span style="color: gray;"></param></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: gray;">///</span><span style="color: green;"> </span><span style="color: gray;"><returns></span><span style="color: green;" lang="ZH-CN">该文档词频词典</span><span style="color: gray;"></returns></span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">public</span> <span style="color: #2b91af;">Dictionary</span><<span style="color: blue;">string</span>, <span style="color: blue;">int</span>> <span style="color: #010001;">GetDictionary</span>(<span style="color: blue;">string</span> <span style="color: #010001;">text</span>)</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>{</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #2b91af;">Dictionary</span><<span style="color: blue;">string</span>, <span style="color: blue;">int</span>> <span style="color: #010001;">dictionary</span> = <span style="color: blue;">new</span> <span style="color: #2b91af;">Dictionary</span><<span style="color: blue;">string</span>, <span style="color: blue;">int</span>>();</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #2b91af;">Regex</span> <span style="color: #010001;">regex</span> = <span style="color: blue;">new</span> <span style="color: #2b91af;">Regex</span>(<span style="color: #a31515;">@"[\u4e00-\u9fa5]+"</span>);</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #2b91af;">MatchCollection</span> <span style="color: #010001;">results</span> = <span style="color: #010001;">regex</span>.<span style="color: #010001;">Matches</span>(<span style="color: #010001;">text</span>);</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">int</span> <span style="color: #010001;">temp</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">foreach</span> (<span style="color: #2b91af;">Match</span> <span style="color: #010001;">word</span> <span style="color: blue;">in</span> <span style="color: #010001;">results</span>)</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>{</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">if</span> (<span style="color: #010001;">dictionary</span>.<span style="color: #010001;">TryGetValue</span>(<span style="color: #010001;">word</span>.<span style="color: #010001;">Value</span>, <span style="color: blue;">out</span> <span style="color: #010001;">temp</span>))</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>{</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #010001;">temp</span>++;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #010001;">dictionary</span>.<span style="color: #010001;">Remove</span>(<span style="color: #010001;">word</span>.<span style="color: #010001;">Value</span>);</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #010001;">dictionary</span>.<span style="color: #010001;">Add</span>(<span style="color: #010001;">word</span>.<span style="color: #010001;">Value</span>, <span style="color: #010001;">temp</span>);</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>}</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">else</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>{</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: #010001;">dictionary</span>.<span style="color: #010001;">Add</span>(<span style="color: #010001;">word</span>.<span style="color: #010001;">Value</span>, 1);</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>}</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>}</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span><span style="color: blue;">return</span> <span style="color: #010001;">dictionary</span>;</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>}</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;"><span style=""> </span>}</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;">}</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt; line-height: normal;"><span style=""><span style="font-size: small;">还有很多可以优化的地方,大家多加思考。如果能够得到适当优化的话,速度还能提高很多。</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 10pt;"><span style="font-size: 6pt; line-height: 115%;"><span style="font-family: Calibri;"><span style="font-size: small;"></span></span></span></p>
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值