用 Ruby scrAPI 做数据采集-优快云博客

本文介绍了一个名为RubyscrAPI的类库，它能够以HTML DOM的方式高效地从其他网站抓取内容。文章通过示例展示了如何利用RubyscrAPI解析复杂的HTML结构，并提取所需的元素。

前天在 Railscasts 上面看到一篇介绍 Ruby scrAPI 这个类库的视频教程《 Screen Scraping with ScrAPI 》，里面介绍了如何通过 scrAPI 以 HTML dom 的方式抓取其它网站的内容的例子，整个方式非常简单有效！ scrAPI 的 HTML 解析机制和 jQuery 的 Selectors 非常像，它可以以 html>body>div#container>div#articles>div.item>div.title 的方式来解析像下面这样的HTML结构

[代码] [Ruby]代码

 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 

  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 

<html xmlns="http://www.w3.org/1999/xhtml"> 

    04<head>

  <meta http-equiv="Content-type" content="text/html; charset=utf-8"> 

  <title>This is the scrAPI collect demo page</title> 

    07</head>

    08<body>

  <div id="header"> 

    <h1 id="demo">Demo</h1> 

    <ul id="nav"> 

      <li><a href="/">Home</a></li> 

      <li class="current"><a href="/articles">Articles</a></li> 

      <li><a href="/about">About</a></li> 

    </ul> 

  </div> 

  <div id="container"> 

    <div id="articles"> 

      <div class="item"> 

        <div class="title"> 

          <a href="/articles/show/1">Sample article title 1</a> 

        </div> 

        <div class="summary"> 

          There is the summary text 1. 

        </div> 

      </div> 

      <div class="item"> 

        <div class="title"> 

          <a href="/articles/show/1">Sample article title 2</a> 

        </div> 

        <div class="summary"> 

          There is the summary text 2. 

        </div> 

      </div> 

      <div class="item"> 

        <div class="title"> 

          <a href="/articles/show/1">Sample article title 3</a> 

        </div> 

        <div class="summary"> 

          There is the summary text 3. 

        </div> 

      </div> 

    </div> 

  </div> 

    45</body>

    46</html>