Keywords: Automatic Data Extraction Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from the web pages without any learning examples or other similar human input. We formally define the notion of a template, and propose a model that describes how values are encoded into pages using a template. We present an extraction algorithm that uses sets of words that have similar occurrence pattern in the input pages, to construct the template. The constructed template is then used to extract values from the pages. We show experimentally that the extracted values make semantic sense in most cases. For more information, please visit our website: http://www.knowlesys.com
Extracting Structured Data from Web Pages
最新推荐文章于 2025-10-21 22:09:11 发布
本文探讨了从使用共同模板布局的网站自动提取数据的问题,提出了一种无需人工输入的学习模型,该模型通过分析页面中相似出现模式的词组来构建模板,并利用此模板从网页中抽取数据库值。
1万+

被折叠的 条评论
为什么被折叠?



