TF-IDF计算:
TF-IDF反映了在文档集合中一个单词对一个文档的重要性,经常在文本数据挖据与信息
提取中用来作为权重因子。在一份给定的文件里,词频(termfrequency-TF)指的是某一
个给定的词语在该文件中出现的频率。逆向文件频率(inversedocument frequency,
IDF)是一个词语普遍重要性的度量。某一特定词语的IDF,可以由总文件数目除以包含
该词语之文件的数目,再将得到的商取对数得到。
相关代码:
- privatestaticPatternr=Pattern.compile("([\\t{}()\",:;.\n])");
- privatestaticList<String>documentCollection;
- //CalculatesTF-IDFweightforeachtermtindocumentd
- privatestaticfloatfindTFIDF(Stringdocument,Stringterm)
- {
- floattf=findTermFrequency(document,term);
- floatidf=findInverseDocumentFrequency(term);
- returntf*idf;
- }
- privatestaticfloatfindTermFrequency(Stringdocument,Stringterm)
- {
- intcount=getFrequencyInOneDoc(document,term);
- return(float)((float)count/(float)(r.split(document).length));
- }
- privatestaticintgetFrequencyInOneDoc(Stringdocument,Stringterm)
- {
- intcount=0;
- for(Strings:r.split(document))
- {
- if(s.toUpperCase().equals(term.toUpperCase())){
- count++;
- }
- }
- returncount;
- }
- privatestaticfloatfindInverseDocumentFrequency(Stringterm)
- {
- //findtheno.ofdocumentthatcontainstheterminwholedocumentcollection
- intcount=0;
- for(Stringdoc:documentCollection)
- {
- count+=getFrequencyInOneDoc(doc,term);
- }
- /*
- *logoftheratiooftotalnoofdocumentinthecollectiontotheno.ofdocumentcontainingtheterm
- *wecanalsouseMath.Log(count/(1+documentCollection.Count))todealwithdividebyzerocase;
- */
- return(float)Math.log((float)documentCollection.size()/(float)count);
- }
相关代码:
- publicstaticfloatfindCosineSimilarity(float[]vecA,float[]vecB)
- {
- floatdotProduct=dotProduct(vecA,vecB);
- floatmagnitudeOfA=magnitude(vecA);
- floatmagnitudeOfB=magnitude(vecB);
- floatresult=dotProduct/(magnitudeOfA*magnitudeOfB);
- //when0isdividedby0itshowsresultNaNsoreturn0insuchcase.
- if(Float.isNaN(result))
- return0;
- else
- return(float)result;
- }
- publicstaticfloatdotProduct(float[]vecA,float[]vecB)
- {
- floatdotProduct=0;
- for(inti=0;i<vecA.length;i++)
- {
- dotProduct+=(vecA[i]*vecB[i]);
- }
- returndotProduct;
- }
- //Magnitudeofthevectoristhesquarerootofthedotproductofthevectorwithitself.
- publicstaticfloatmagnitude(float[]vector)
- {
- return(float)Math.sqrt(dotProduct(vector,vector));
- }
零词过滤(stop-words filter)
零词列表
ftp://ftp.cs.cornell.edu/pub/smart/english.stop
关于TF-IDF参考这里:
链接–>http://en.wikipedia.org/wiki/Tf*idf