Abstract—Web pages are often decorated with extraneous information (such as navigation bars, branding banners, JavaScript and advertisements). This kind of information may distract users from actual content they are really interested in and may reduce effects of many advanced web applications. Automatic content extraction has many applications ranging from providing data for web mining to realizing better accessing the web over mobile devices. In this paper, we propose ContentEx, a framework for automatic content extraction programs, which we use to organize codes of automatic content extraction programs and to facilitate the development of related solutions. We also introduce how we extract content from forum pages in this framework to fulfill the requirement from our actual application.
本文提出了一种名为ContentEx的自动网页内容提取框架,旨在帮助开发者组织和开发自动内容提取程序。该框架能够有效去除网页上的冗余信息,如导航栏、广告等,从而让用户更加专注于实际感兴趣的内容。此外,还特别介绍了如何利用该框架从论坛页面中提取有用信息。
2234

被折叠的 条评论
为什么被折叠?



