Text/HTML Similarity Algorithms

Answer  
Subject: Re: Text/HTML Similarity Algorithms
Answered By: eiffel-ga on 29 Apr 2004 05:48 PDT
Rated:5 out of 5 stars
 
Hi nathanntg,

Using computers for sophisticated similarly analysis of text documents
is an area of active research interest, but straightforward similarity
measurement is a "solved problem".

Jack Lynch, of the University of Pennsylvania, has written a web page
that provides an overview of algorithms for text comparison, working
from the simplest to the most sophisticated. The methods include:

   -  Measuring shared letters (discarded as being not useful)
   -  Measuring shared runs of letters
   -  Measuring shared words
   -  Measuring shared runs of words
   -  Measuring shared runs of word stems
   -  Measuring shared vocabularies
   -  Measuring shared meanings
   -  Measuring shared syntax, even when vocabulary differs

Lynch discusses the advantages and disadvantages for each algorithm,
and provides open source code in C (for measuring shared runs of
letters) and in Perl (for measuring shared vocabularies):

Text Analysis with Compare
http://www.english.upenn.edu/~jlynch/Computing/compare.html

Lynch dismisses as impractical the measuring of shared meanings,
because it "would involve not only the most powerful processor, but an
almost unimaginably large database -- the sort of thing that would
require a large team working full-time for years".

But he was writing in 1995, and since then processors have become more
powerful, databases have become larger, and a large team has been
working for years...

A paper published by Michael Lee, Brandon Pincombe and Matthew Welsh
of the University of Adelaide discusses text similarity comparison
using word-based, n-gram and Latent Semantic Analysis methods. They
correlate the results of these comparison methods with human
comparison and find that the machine comparisons are useful, but
inferior compared to human comparison:

A Comparison of Machine Measures of Document Similarity with Human Judgements
http://www.psychology.adelaide.edu.au/members/staff/michaellee/homepage/lee_pincombe_welsh.pdf

Further practical work on document comparison using Latent Semantic
Analysis has been done since then. Thomas Landauer describes LSA in
detail in the following article, which doesn't include a complete
algorithm but does contain a complete textual description of the
method from which an algorithm can be derived:

Chapter 13. Landauer - The computational basis of learning
http://lsa.colorado.edu/papers/Ross-final-submit.pdf

Web applications have been written to demonstrate that it is possible
to implement the concepts of LSA described in the above paper. I found
the "One to many comparison" application to be most interesting. I
entered the text of your question as the main text. For the texts to
compare against, I entered the text of your clarification and an
unrelated text. It picked your two pieces as being semantically
closely related, compared to the third piece:

Latent Semantic Analysis at Colorado University Boulder
http://lsa.colorado.edu/

If you want something more down-to-earth and less cutting-edge, you
could consider Dick Grune's software and text similarity tester. This
has been in use at Vrije University Amsterdam for a number of years to
compare submitted assignments - to check for copying. It compares
texts in programming languages or in natural language. It is available
for free as C source code and as DOS binaries:

The software and text similarity tester SIM
http://www.cs.vu.nl/~dick/sim.html

The SIM algorithms are available as pseudocode:

Concise Report on the Algorithms in SIM
ftp://ftp.cs.vu.nl/pub/dick/similarity_tester/TechnReport

The above applications are designed to work with plain text files,
whereas you also wish to work with HTML files. Two approaches are
available:

1. Treat the HTML file as if it were a text file, in which case
   your comparisons will be based on similarity in the HTML markup
   as well as similarity in the text, or

2. Convert each HTML file to a text file (on the fly if necessary).

An easy way to convert HTML to text is to use a text-mode web browser
that has a command-line "dump to file" option. The lynx browser can do
this using the "-dump" option:

Lynx Information
http://www.lynx.browser.org/

You can try out the HTML-to-text functionality of the lynx browser
online. Given a URL it will display the text content of that web page:

Lynx Viewer
http://www.delorie.com/web/lynxview.html

I trust that this answer provides the information that you are
seeking. Please request clarification if this answer does not yet meet
your needs.


Google Search Strategy:

"compare two text files" "index of similarity"
http://www.google.com/search?q=%22compare+two+text+files%22+%22index+of+similarity%22

text similarity comparing OR compare OR comparison
http://www.google.com/search?q=text+similarity+comparing+OR+compare+OR+comparison

"text mode browser"
http://www.google.com/search?q=%22text+mode+browser%22


Regards,
eiffel-ga
 
这是我之前做过的一个项目,把这个设计结构有用的也加入到之前的项目中,做一个最全的:后端项目结构深度解析:testrealend 项目 一、项目根目录核心文件 plaintext testrealend/ ├── app.py # Flask应用主入口 │ api # 职责:创建应用实例、配置扩展、注册API路由 ├── requirements.txt # 依赖包清单(pip install -r 安装) ├── add_user_script.py # 管理员账户初始化脚本 ├── update_user_script.py # 用户信息更新脚本 ├── temp_qrcode.png # 临时二维码图片(水印功能相关) ├── embed/ # 水印嵌入后ZIP文件存储目录 │ # 文件名格式:员工ID_时间戳_嵌入结果.zip ├── compare/ # 水印比对文件目录(原始/提取水印对比) ├── static/ # 静态资源目录(CSS/JS/图片,当前项目未直接使用) └── templates/ # HTML模板目录(后端渲染页面时使用,当前项目为空) 二、核心功能模块解析 (1)API 接口层:resource/ plaintext src/resource/ ├── common_resource.py # 通用接口:注册(Register)、登录(Login)、登出(Logout) ├── adm_resource.py # 管理员接口:员工管理、水印操作 ├── emp_resource.py # 员工接口:数据申请、下载、查看 ├── shp_data_resource.py # SHP数据接口:列表查询、ID检索(重点修改模块) ├── application_resource.py # 数据申请接口:提交/审批流程 ├── nav_resource.py # 导航菜单接口:权限动态渲染 ├── generate_watermark.py # 水印生成接口:二维码/文本转水印 ├── embed_watermark.py # 水印嵌入接口:SHP文件处理 ├── download_file_resource.py # 文件下载接口:权限控制 ├── receive_zip_to_extract.py # ZIP上传接口:解压与验证 └── __init__.py # 包初始化文件 核心逻辑: • 每个.py文件对应一类业务场景,通过Flask-RESTful的Resource类定义 HTTP 接口(GET/POST/PUT/DELETE) • 接口路径通过api.add_resource(ResourceClass, '/api/path')在app.py中注册 (2)业务逻辑层:server/ plaintext src/server/ ├── shp_data_server.py # SHP数据业务逻辑:数据库查询、文件处理 ├── adm_server.py # 管理员业务逻辑:账户管理、权限校验 ├── emp_server.py # 员工业务逻辑:数据权限控制 ├── common_server.py # 通用业务逻辑:用户认证、 token管理 ├── application_server.py # 数据申请业务逻辑:流程状态管理(注意:拼写应为application) └── __init__.py # 包初始化文件 设计模式: • 资源类(Resource)接收请求后,调用对应服务类(Server)的方法执行具体逻辑 • 实现接口层与业务层解耦,便于单元测试和逻辑复用 (3)数据模型层:model/ plaintext src/model/ ├── Shp_Data.py # SHP文件模型:存储文件元信息(路径/属性/状态) ├── Adm_Account.py # 管理员账户模型:用户名/密码哈希/权限等级 ├── Employee_Info.py # 员工信息模型:基本资料/部门/角色 ├── Application.py # 数据申请模型:申请记录/审批状态/时间戳 └── __init__.py # 包初始化文件 技术细节: • 使用SQLAlchemy ORM映射数据库表,支持db.Model继承 • 字段定义包含String/Integer/DateTime等类型,部分字段设置unique/index索引 (4)算法模块:algorithm/ plaintext src/algorithm/ ├── NC.py # 归一化相关系数算法(水印相似度计算) ├── embed.py # 水印嵌入算法:SHP文件空间域修改 ├── extract.py # 水印提取算法:信号检测与解码 ├── get_coor.py # 坐标提取工具:从SHP获取几何坐标 ├── is_multiple.py # 倍数判断工具:坐标精度处理 ├── to_geodataframe.py # 地理数据转换:SHP转GeoDataFrame ├── embed/ # 水印嵌入测试数据目录 └── __init__.py # 包初始化文件 (5)扩展与工具:extension/ & utils/ • extension/ plaintext src/extension/ ├── extension.py # 扩展初始化:db(SQLAlchemy)/limiter(请求限流) └── __init__.py • utils/ plaintext src/utils/ ├── required.py # 请求参数校验:必填字段检查 └── __init__.py (6)通用工具:common/ plaintext src/common/ ├── api_tools.py # API辅助函数:响应格式化/错误处理 ├── constants.py # 项目常量:状态码/路径前缀/权限标识 └── __init__.py 三、数据库与配置模块 (1)数据库迁移:migrations/ plaintext src/migrations/ ├── alembic.ini # Alembic配置文件 ├── env.py # 数据库连接配置(生产/开发环境区分) ├── script.py.mako # 迁移脚本模板 └── versions/ # 具体迁移脚本目录(每个版本一个.py文件) 使用场景: • 模型变更时通过flask db migrate生成脚本,flask db upgrade更新数据库 (2)IDE 配置与缓存:.idea/ & pycache/ • .idea/:PyCharm 项目配置(勿手动修改,已加入.gitignore) • __pycache__/:Python 字节码缓存(自动生成,提升模块加载速度)
07-03
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值