前天在 Railscasts 上面看到一篇介绍 Ruby scrAPI 这个类库的视频教程《 Screen Scraping with ScrAPI 》,里面介绍了如何通过 scrAPI 以 HTML dom 的方式抓取其它网站的内容的例子,整个方式非常简单有效! scrAPI 的 HTML 解析机制和 jQuery 的 Selectors 非常像,它可以以 html>body>div#container>div#articles>div.item>div.title 的方式来解析像下面这样的HTML结构
[代码] [Ruby]代码
01 | <! DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" |
02 | "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" > |
03 | <html xmlns= "http://www.w3.org/1999/xhtml" > |
04 | <head> |
05 | <meta http-equiv= "Content-type" content= "text/html; charset=utf-8" > |
06 | <title>This is the scrAPI collect demo page</title> |
07 | </head> |
08 | <body> |
09 | <div id= "header" > |
10 | <h1 id= "demo" >Demo</h1> |
11 | <ul id= "nav" > |
12 | <li><a href= "/" >Home</a></li> |
13 | <li class = "current" ><a href= "/articles" >Articles</a></li> |
14 | <li><a href= "/about" >About</a></li> |
15 | </ul> |
16 | </div> |
17 | <div id= "container" > |
18 | <div id= "articles" > |
19 | <div class = "item" > |
20 | <div class = "title" > |
21 | <a href= "/articles/show/1" >Sample article title 1 </a> |
22 | </div> |
23 | <div class = "summary" > |
24 | There is the summary text 1 . |
25 | </div> |
26 | </div> |
27 | <div class = "item" > |
28 | <div class = "title" > |
29 | <a href= "/articles/show/1" >Sample article title 2 </a> |
30 | </div> |
31 | <div class = "summary" > |
32 | There is the summary text 2 . |
33 | </div> |
34 | </div> |
35 | <div class = "item" > |
36 | <div class = "title" > |
37 | <a href= "/articles/show/1" >Sample article title 3 </a> |
38 | </div> |
39 | <div class = "summary" > |
40 | There is the summary text 3 . |
41 | </div> |
42 | </div> |
43 | </div> |
44 | </div> |
45 | </body> |
46 | </html> |