Two extractors couldn't work together

作者使用Scrubyt从Google和Yahoo抓取URL时遇到问题。当同时抓取两个搜索引擎时,后定义的抓取器似乎受前一个的影响,导致结果不完整。此问题与XPath选择器的使用顺序有关。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Hi,everyone
  I have enjoyed Scrubyt for days and it worked greatly in most case.However,problems came out when scraped urls from Google and Yahoo at the same time.Here is my code:

 

require 'rubygems'
require 'scrubyt'

Scrubyt.logger = Scrubyt::Logger.new
query = 'ruby'
google_data = Scrubyt::Extractor.define do
      fetch 'http://www.google.com/ncr'
      fill_textfield 'q', query
      submit
      
      #retrieve by xpath
      title "/html/body/div/div/div/a" do
        url "href", :type => :attribute
      end
    end #end of extrator
 google_file = File.open("google.xml", "w")
 google_data.to_xml.write(google_file, 1)
 google_file.close
 
 yahoo_data = Scrubyt::Extractor.define do
      fetch 'http://search.yahoo.com'
      fill_textfield 'p', query
      submit
      
       #retrieve by xpath
      title "/html/body/div/div/div/div/div/div/div/ol/li/div/h3/a" do
        url "href", :type => :attribute
      end
 end #end of extrator
   
 yahoo_file = File.open("yahoo.xml", "w")
 yahoo_data.to_xml.write(yahoo_file, 1)
 yahoo_file.close

 

 

Running Environment: Ubuntu 7.04 + Netbeans 6.0 + Scrubyt

 

google.xml

<root>
    <title>
      <url>http://www.ruby-lang.org/</url>
    </title>
    <title>
      <url>http://www.ruby-lang.org/en/20020101.html</url>
    </title>
    ...
<root>

 

 

yahoo.xml

<root>
    <title>
      <url>http://rds.yahoo.com/_ylt=A0oGklhqbodHe08AchtXNyoA;_ylu=X3oDMTE5MXY5dDllBHNlYwNzcgRwb3MDMQRjb2xvA3NrMQR2dGlkA1lTMTk4XzgyBGwDV1Mx/SIG=11ff2e34s/EXP=1200144362/**http%3a//www.ruby-lang.org/en</url>
    </title>
    <title>
      <url>http://rds.yahoo.com/_ylt=A0oGklhqbodHe08AdBtXNyoA;_ylu=X3oDMTE5cHJpN25qBHNlYwNzcgRwb3MDMgRjb2xvA3NrMQR2dGlkA1lTMTk4XzgyBGwDV1Mx/SIG=12aq03736/EXP=1200144362/**http%3a//en.wikipedia.org/wiki/Ruby_programming_language</url>
    </title>
       ...
<root>

 

If switched the order of two extractors,that's define yahoo extractor fitstly,the result changed:


google.xml

<root/>

 

yahoo.xml

<root>
    <title>
      <url>http://www.ruby-lang.org/en</url>
    </title>
    <title>
      <url>http://en.wikipedia.org/wiki/Ruby_programming_language</url>
    </title>
    .....
<root> 

 

It seems the latter extractor will be influenced by the former one. Since xpath  I used for Yahoo is longer than Google, the result form Google is empty when defined Yahoo extractor firstly. 
  Why is that and how can I overcome this problem? Thanks in advance.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值