该文介绍了一种抽取博客正文和评论的方法.
Donglin Cao, Xiangwen Liao, Hongbo Xu, Shuo Bai. Blog Post and Comment Extraction Using Information Quantity of Web Format. In Proceedings of the 2008 Asia Information Retrieval Symposium(AIRS-2008), January 15-28, 2008, Harbin, China.
Abstract: With the development of the research on blogosphere, acquiring the post and comment from blog page becomes more important in improving the search performance. In this paper, we present a two-stage method. First, we combine the advantage of the vision information and the effective text information to locate the main text which represents the theme of blog page. Second, we use the information quantity of separator to detect the boundary between the post and comment. According to our experiments, this method achieves a good performance in extraction and improves the performance of blog search.
Donglin Cao, Xiangwen Liao, Hongbo Xu, Shuo Bai. Blog Post and Comment Extraction Using Information Quantity of Web Format. In Proceedings of the 2008 Asia Information Retrieval Symposium(AIRS-2008), January 15-28, 2008, Harbin, China.
Abstract: With the development of the research on blogosphere, acquiring the post and comment from blog page becomes more important in improving the search performance. In this paper, we present a two-stage method. First, we combine the advantage of the vision information and the effective text information to locate the main text which represents the theme of blog page. Second, we use the information quantity of separator to detect the boundary between the post and comment. According to our experiments, this method achieves a good performance in extraction and improves the performance of blog search.
本文提出一种两阶段方法用于从博客页面中精确抽取正文及评论部分。首先利用视觉信息和文本信息定位主题文本,其次通过分隔符信息量检测正文与评论间的边界,实验表明该方法能有效提升博客搜索性能。
7390

被折叠的 条评论
为什么被折叠?



