[RUBY盗摄流出] 迷途的羔羊啊，陈老师的宝石小秘书助侬回归家园

最新推荐文章于 2024-07-20 22:48:09 发布

最新推荐文章于 2024-07-20 22:48:09 发布 · 196 阅读

文章标签：

#Ruby #jQuery #Firebug #HTML #Blog #ViewUI

陈老师只是个传说专栏收录该内容

5 篇文章

订阅专栏

本文介绍了一种使用Ruby语言和hpricot库爬取新浪博客文章的方法，包括获取文章链接、标题、内容及发布时间，并将数据保存到本地文件的过程。

这次要说的是这个博客: 上次答应帮林子大了什么X都有童鞋把介个倒过来，过程说一下

首先，用firebug看看页面元素：

介个有什么规律捏？

详细链接是包在class为articleTitle的<div>里头的。
分页链接有规律, http://blog.sina.com.cn/s/indexlist_1233551893_1.html到 http://blog.sina.com.cn/s/indexlist_1233551893_14.html
具体页面的正文，是在class为articleBody的<div>里头的。

然后，偷窥一下 ruby的网页解析

准备ruby的网页解析库 hpricot :
Gem install hpricot
这东西很好用，操作dom元素跟jquery一样的简洁。代码只有下面这些(注释就不上鸟):

#!/usr/bin/ruby
require 'hpricot'
require 'open-uri'

article_urls = Array.new()

1.upto(14) do |i|
    doc = Hpricot.parse(open("http://blog.sina.com.cn/s/indexlist_1233551893_#{i}.html"))
    (doc/"div[@class='articleTitle']/a[@target='_blank']").each do |f|
    article_urls << f.attributes['href']
    end
end

index = 0
article_urls.each do |url|
    index +=1
    puts "now fuck:"+url
    doc = Hpricot.parse(open(url))
    title = (doc/"div[@class='articleTitle']/div/b").first.inner_html
    content = (doc/"div[@id='articleBody']").first.inner_html
    time = (doc/"span[@class='time']").first.inner_html

    file=File.open("D:\\sina_mockee\\#{index}.txt","w")
    file.puts title
    file.puts time
    file.puts content
    file.close
end

puts "#{article_urls.length} url crawled."

结果存到文本里面，接下来的处理是入库，就不多说鸟：