Text/HTML Similarity Algorithms

最新推荐文章于 2025-08-24 12:15:06 发布
转载最新推荐文章于 2025-08-24 12:15:06 发布 · 1.5k 阅读
文章标签：
#semantic #browser #algorithm #html #methods #file
PIM 专栏收录该内容
9 篇文章
订阅专栏
Answer
Subject: Re: Text/HTML Similarity Algorithms
Answered By: eiffel-ga on 29 Apr 2004 05:48 PDT
Rated: 5 out of 5 stars
Hi nathanntg,

Using computers for sophisticated similarly analysis of text documents
is an area of active research interest, but straightforward similarity
measurement is a "solved problem".

Jack Lynch, of the University of Pennsylvania, has written a web page
that provides an overview of algorithms for text comparison, working
from the simplest to the most sophisticated. The methods include:

   -  Measuring shared letters (discarded as being not useful)
   -  Measuring shared runs of letters
   -  Measuring shared words
   -  Measuring shared runs of words
   -  Measuring shared runs of word stems
   -  Measuring shared vocabularies
   -  Measuring shared meanings
   -  Measuring shared syntax, even when vocabulary differs

Lynch discusses the advantages and disadvantages for each algorithm,
and provides open source code in C (for measuring shared runs of
letters) and in Perl (for measuring shared vocabularies):

Text Analysis with Compare
http://www.english.upenn.edu/~jlynch/Computing/compare.html

Lynch dismisses as impractical the measuring of shared meanings,
because it "would involve not only the most powerful processor, but an
almost unimaginably large database -- the sort of thing that would
require a large team working full-time for years".

But he was writing in 1995, and since then processors have become more
powerful, databases have become larger, and a large team has been
working for years...

A paper published by Michael Lee, Brandon Pincombe and Matthew Welsh
of the University of Adelaide discusses text similarity comparison
using word-based, n-gram and Latent Semantic Analysis methods. They
correlate the results of these comparison methods with human
comparison and find that the machine comparisons are useful, but
inferior compared to human comparison:

A Comparison of Machine Measures of Document Similarity with Human Judgements
http://www.psychology.adelaide.edu.au/members/staff/michaellee/homepage/lee_pincombe_welsh.pdf

Further practical work on document comparison using Latent Semantic
Analysis has been done since then. Thomas Landauer describes LSA in
detail in the following article, which doesn't include a complete
algorithm but does contain a complete textual description of the
method from which an algorithm can be derived:

Chapter 13. Landauer - The computational basis of learning
http://lsa.colorado.edu/papers/Ross-final-submit.pdf

Web applications have been written to demonstrate that it is possible
to implement the concepts of LSA described in the above paper. I found
the "One to many comparison" application to be most interesting. I
entered the text of your question as the main text. For the texts to
compare against, I entered the text of your clarification and an
unrelated text. It picked your two pieces as being semantically
closely related, compared to the third piece:

Latent Semantic Analysis at Colorado University Boulder
http://lsa.colorado.edu/

If you want something more down-to-earth and less cutting-edge, you
could consider Dick Grune's software and text similarity tester. This
has been in use at Vrije University Amsterdam for a number of years to
compare submitted assignments - to check for copying. It compares
texts in programming languages or in natural language. It is available
for free as C source code and as DOS binaries:

The software and text similarity tester SIM
http://www.cs.vu.nl/~dick/sim.html

The SIM algorithms are available as pseudocode:

Concise Report on the Algorithms in SIM
ftp://ftp.cs.vu.nl/pub/dick/similarity_tester/TechnReport

The above applications are designed to work with plain text files,
whereas you also wish to work with HTML files. Two approaches are
available:

1. Treat the HTML file as if it were a text file, in which case
   your comparisons will be based on similarity in the HTML markup
   as well as similarity in the text, or

2. Convert each HTML file to a text file (on the fly if necessary).

An easy way to convert HTML to text is to use a text-mode web browser
that has a command-line "dump to file" option. The lynx browser can do
this using the "-dump" option:

Lynx Information
http://www.lynx.browser.org/

You can try out the HTML-to-text functionality of the lynx browser
online. Given a URL it will display the text content of that web page:

Lynx Viewer
http://www.delorie.com/web/lynxview.html

I trust that this answer provides the information that you are
seeking. Please request clarification if this answer does not yet meet
your needs.


Google Search Strategy:

"compare two text files" "index of similarity"
http://www.google.com/search?q=%22compare+two+text+files%22+%22index+of+similarity%22

text similarity comparing OR compare OR comparison
http://www.google.com/search?q=text+similarity+comparing+OR+compare+OR+comparison

"text mode browser"
http://www.google.com/search?q=%22text+mode+browser%22


Regards,
eiffel-ga