Scrapy LinkExtractor

LinkExtractor的使用非常简单,通过一个例子进行讲解,使用LinkExtractor替代Selector完成BooksSpider提取链接的任务,代码如下:

Python
from <span class="wp_keywordlink_affiliate"><a href="https://www.168seo.cn/tag/scrapy" title="View all posts in scrapy" target="_blank">scrapy</a></span>.linkextractors import LinkExtractor class BooksSpider(<span class="wp_keywordlink_affiliate"><a href="https://www.168seo.cn/tag/scrapy" title="View all posts in scrapy" target="_blank">scrapy</a></span>.Spider): ... def parse(self, response): ... # 提取链接 # 下一页的url 在ul.pager > li.next > a 里面 # 例如: <li class="next"><a href="catalogue/page-2.html">next</a></li> le = LinkExtractor(restrict_css='ul.pager li.next') links = le.extract_links(response) if links: next_url = links[0].url yield scrapy.Request(next_url, callback=self.parse)
1
2
3
4
5
6
7
8
9
10
11
12
13
     from scrapy . linkextractors import LinkExtractor
     class BooksSpider ( scrapy . Spider ) :
         . . .
         def parse ( self , response ) :
             . . .
             # 提取链接
             # 下一页的url 在ul.pager > li.next > a 里面
             # 例如: <li class="next"><a href="catalogue/page-2.html">next</a></li>
             le = LinkExtractor ( restrict_css = 'ul.pager li.next' )
             links = le . extract_links ( response )
             if links :
                 next_url = links [ 0 ] . url
                 yield scrapy . Request ( next_url , callback = self . parse )

 

对上述代码解释如下:

● 导入LinkExtractor,它位于scrapy.linkextractors模块。

● 创建一个LinkExtractor对象,使用一个或多个构造器参数描述提取规则,这里传递给restrict_css参数一个CSS选择器表达式。它描述出下一页链接所在的区域(在li.next下)。

● 调用LinkExtractor对象的extract_links方法传入一个Response对象,该方法依据创建对象时所描述的提取规则,在Response对象所包含的页面中提取链接,最终返回一个列表,其中的每一个元素都是一个Link对象,即提取到的一个链接。

● 由于页面中的下一页链接只有一个,因此用links[0]获取Link对象,Link对象的url属性便是链接页面的绝对url地址(无须再调用response.urljoin方法),用其构造Request对象并提交。

通过上面的例子,相信大家已经了解了使用LinkExtractor对象提取页面中链接的流程。

6.2 描述提取规则

接下来,我们来学习使用LinkExtractor的构造器参数描述提取规则。

为了在讲解过程中举例,首先制造一个实验环境,创建两个包含多个链接的HTML页面:

Python
<!-- example1.html --> <html> <body> <div id="top"> <p>下面是一些站内链接</p> <a class="internal" href="/intro/install.html">Installation guide</a> <a class="internal" href="/intro/tutorial.html">Tutorial</a> <a class="internal" href="../examples.html">Examples</a> </div><div id="bottom"> <p>下面是一些站外链接</p> <a href="http://stackoverflow.com/tags/scrapy/info">StackOverflow</a> <a href="https://github.com/scrapy/scrapy"> Fork on Github</a> </div> </body> </html> <!-- example2.html --> <html> <head> <script type='text/javascript' src='/js/app1.js'/> <script type='text/javascript' src='/js/app2.js'/> </head> <body> <a href="/home.html">主页</a> <a href="javascript:goToPage('/doc.html'); return false">文档</a> <a href="javascript:goToPage('/example.html'); return false">案例</a> </body> </html>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
< ! -- example1 . html -- >
     < html >
         < body >
           < div id = "top" >
               < p >下面是一些站内链接 < / p >
               < a class = "internal" href = "/intro/install.html" > Installation guide < / a >
               < a class = "internal" href = "/intro/tutorial.html" > Tutorial < / a >
               < a class = "internal" href = "../examples.html" > Examples < / a >
           < / div > < div id = "bottom" >
< p >下面是一些站外链接 < / p >
< a href = "http://stackoverflow.com/tags/scrapy/info" > StackOverflow < / a > < a href = "https://github.com/scrapy/scrapy" >
Fork on Github < / a > < / div > < / body > < / html > < ! -- example2 . html -- > < html >
 
< head > < script type = 'text/javascript' src = '/js/app1.js' / > < script type = 'text/javascript' src = '/js/app2.js' / > < / head > < body > < a href = "/home.html" >主页 < / a > < a href = "javascript:goToPage('/doc.html'); return false" >文档 < / a > < a href = "javascript:goToPage('/example.html'); return false" >案例 < / a > < / body > < / html >

使用以上两个HTML文本构造两个Response对象:

Python
>>> from scrapy.http import HtmlResponse >>> html1 = open('exmaple1.html').read() >>> html2 = open('exmaple2.html').read() >>> response1 = HtmlResponse(url='http://example1.com', body=html1, encoding='utf8') >>> response2 = HtmlResponse(url='http://example2.com', body=html2, encoding='utf8')
1
2
3
4
5
   >>> from scrapy . http import HtmlResponse
     >>> html1 = open ( 'exmaple1.html' ) . read ( )
     >>> html2 = open ( 'exmaple2.html' ) . read ( )
     >>> response1 = HtmlResponse ( url = 'http://example1.com' , body = html1 , encoding = 'utf8' )
     >>> response2 = HtmlResponse ( url = 'http://example2.com' , body = html2 , encoding = 'utf8' )

 

现在有了实验环境,先说明一种特例情况,LinkExtractor构造器的所有参数都有默认值,如果构造对象时不传递任何参数(使用默认值),就提取页面中所有链接。以下代码将提取页面example1.html中的所有链接:

Python
>>> from scrapy.linkextractors import LinkExtractor >>> le = LinkExtractor() >>> links = le.extract_links(response1) >>> [link.url for link in links] ['http://example1.com/intro/install.html', 'http://example1.com/intro/tutorial.html', 'http://example1.com/../examples.html', 'http://stackoverflow.com/tags/scrapy/info', 'https://github.com/scrapy/scrapy']
1
2
3
4
5
6
7
8
9
>>> from scrapy . linkextractors import LinkExtractor
     >>> le = LinkExtractor ( )
     >>> links = le . extract_links ( response1 )
     >>> [ link . url for link in links ]
     [ 'http://example1.com/intro/install.html' ,
     'http://example1.com/intro/tutorial.html' ,
     'http://example1.com/../examples.html' ,
     'http://stackoverflow.com/tags/scrapy/info' ,
     'https://github.com/scrapy/scrapy' ]
Python
>>> links = le.extract_links(response1) >>> [link.url for link in links] ['http://example1.com/intro/install.html', 'http://example1.com/intro/tutorial.html']
1
2
3
4
     >>> links = le . extract_links ( response1 )
     >>> [ link . url for link in links ]
     [ 'http://example1.com/intro/install.html' ,
     'http://example1.com/intro/tutorial.html' ]

 

下面依次介绍LinkExtractor构造器的各个参数:

● allow

接收一个正则表达式或一个正则表达式列表,提取绝对url与正则表达式匹配的链接,如果该参数为空(默认),就提取全部链接。

示例 提取页面example1.html中路径以/intro开始的链接:

Python
>>> from scrapy.linkextractors import LinkExtractor >>> pattern = '/intro/.+\.html$' >>> le = LinkExtractor(allow=pattern)
1
2
3
>>> from scrapy . linkextractors import LinkExtractor
>>> pattern = '/intro/.+\.html$'
>>> le = LinkExtractor ( allow = pattern )

 

● deny

接收一个正则表达式或一个正则表达式列表,与allow相反,排除绝对url与正则表达式匹配的链接。

示例 提取页面example1.html中所有站外链接(即排除站内链接):

Python
>>> from scrapy.linkextractors import LinkExtractor >>> from urllib.parse import urlparse >>> pattern = patten = '^' + urlparse(response1.url).geturl() >>> pattern '^http://example1.com' >>> le = LinkExtractor(deny=pattern) >>> links = le.extract_links(response1) >>> [link.url for link in links] ['http://stackoverflow.com/tags/scrapy/info', 'https://github.com/scrapy/scrapy']
1
2
3
4
5
6
7
8
9
10
     >>> from scrapy . linkextractors import LinkExtractor
     >>> from urllib . parse import urlparse
     >>> pattern = patten = '^' + urlparse ( response1 . url ) . geturl ( )
     >>> pattern
     '^http://example1.com'
     >>> le = LinkExtractor ( deny = pattern )
     >>> links = le . extract_links ( response1 )
     >>> [ link . url for link in links ]
     [ 'http://stackoverflow.com/tags/scrapy/info' ,
       'https://github.com/scrapy/scrapy' ]

 

● allow_domains

接收一个域名或一个域名列表,提取到指定域的链接。

示例 提取页面example1.html中所有到github.com和stackoverflow.com这两个域的链接:

Python
>>> from scrapy.linkextractors import LinkExtractor >>> domains = ['github.com', 'stackoverflow.com'] >>> le = LinkExtractor(allow_domains=domains) >>> links = le.extract_links(response1) >>> [link.url for link in links] ['http://stackoverflow.com/tags/scrapy/info', 'https://github.com/scrapy/scrapy']
1
2
3
4
5
6
7
     >>> from scrapy . linkextractors import LinkExtractor
     >>> domains = [ 'github.com' , 'stackoverflow.com' ]
     >>> le = LinkExtractor ( allow_domains = domains )
     >>> links = le . extract_links ( response1 )
     >>> [ link . url for link in links ]
     [ 'http://stackoverflow.com/tags/scrapy/info' ,
       'https://github.com/scrapy/scrapy' ]

 

● deny_domains

接收一个域名或一个域名列表,与allow_domains相反,排除到指定域的链接。

示例 提取页面example1.html中除了到github.com域以外的链接:

Python
>>> from scrapy.linkextractors import LinkExtractor >>> le = LinkExtractor(deny_domains='github.com') >>> links = le.extract_links(response1) >>> [link.url for link in links] ['http://example1.com/intro/install.html', 'http://example1.com/intro/tutorial.html', 'http://example1.com/../examples.html', 'http://stackoverflow.com/tags/scrapy/info']
1
2
3
4
5
6
7
8
     >>> from scrapy . linkextractors import LinkExtractor
     >>> le = LinkExtractor ( deny_domains = 'github.com' )
     >>> links = le . extract_links ( response1 )
     >>> [ link . url for link in links ]
     [ 'http://example1.com/intro/install.html' ,
       'http://example1.com/intro/tutorial.html' ,
       'http://example1.com/../examples.html' ,
       'http://stackoverflow.com/tags/scrapy/info' ]

● restrict_xpaths

接收一个XPath表达式或一个XPath表达式列表,提取XPath表达式选中区域下的链接。

示例 提取页面example1.html中<div id="top">元素下的链接:

Python
>>> from scrapy.linkextractors import LinkExtractor >>> le = LinkExtractor(restrict_xpaths='//div[@id="top"]') >>> links = le.extract_links(response1) >>> [link.url for link in links] ['http://example1.com/intro/install.html', 'http://example1.com/intro/tutorial.html', 'http://example1.com/../examples.html']
1
2
3
4
5
6
7
     >>> from scrapy . linkextractors import LinkExtractor
     >>> le = LinkExtractor ( restrict_xpaths = '//div[@id="top"]' )
     >>> links = le . extract_links ( response1 )
     >>> [ link . url for link in links ]
     [ 'http://example1.com/intro/install.html' ,
     'http://example1.com/intro/tutorial.html' ,
     'http://example1.com/../examples.html' ]

 

● restrict_css

接收一个CSS选择器或一个CSS选择器列表,提取CSS选择器选中区域下的链接。

示例 提取页面example1.html中<div id="bottom">元素下的链接:

Python
>>> from scrapy.linkextractors import LinkExtractor >>> le = LinkExtractor(restrict_css='div#bottom') >>> links = le.extract_links(response1) >>> [link.url for link in links] ['http://stackoverflow.com/tags/scrapy/info', 'https://github.com/scrapy/scrapy']
1
2
3
4
5
6
     >>> from scrapy . linkextractors import LinkExtractor
     >>> le = LinkExtractor ( restrict_css = 'div#bottom' )
     >>> links = le . extract_links ( response1 )
     >>> [ link . url for link in links ]
     [ 'http://stackoverflow.com/tags/scrapy/info' ,
       'https://github.com/scrapy/scrapy' ]

● tags

接收一个标签(字符串)或一个标签列表,提取指定标签内的链接,默认为['a', 'area']。

● attrs

接收一个属性(字符串)或一个属性列表,提取指定属性内的链接,默认为['href']。

示例 提取页面example2.html中引用JavaScript文件的链接:

Python
>>> from scrapy.linkextractors import LinkExtractor >>> le = LinkExtractor(tags='script', attrs='src') >>> links = le.extract_links(response2) >>> [link.url for link in links] ['http://example2.com/js/app1.js', 'http://example2.com/js/app2.js']
1
2
3
4
5
6
     >>> from scrapy . linkextractors import LinkExtractor
     >>> le = LinkExtractor ( tags = 'script' , attrs = 'src' )
     >>> links = le . extract_links ( response2 )
     >>> [ link . url for link in links ]
     [ 'http://example2.com/js/app1.js' ,
       'http://example2.com/js/app2.js' ]

 

● process_value

接收一个形如func(value)的回调函数。如果传递了该参数,LinkExtractor将调用该回调函数对提取的每一个链接(如a的href)进行处理,回调函数正常情况下应返回一个字符串(处理结果),想要抛弃所处理的链接时,返回None。

示例 在页面example2.html中,某些a的href属性是一段JavaScript代码,代码中包含了链接页面的实际url地址,此时应对链接进行处理,提取页面example2.html中所有实际链接:

Python
>>> import re >>> def process(value): ... m = re.search("javascript:goToPage\('(.*?)'", value) ... # 如果匹配,就提取其中url 并返回,不匹配则返回原值 ... if m: ... value = m.group(1) ... return value … >>> from scrapy.linkextractors import LinkExtractor >>> le = LinkExtractor(process_value=process) >>> links = le.extract_links(response2) >>> [link.url for link in links] ['http://example2.com/home.html', 'http://example2.com/doc.html', 'http://example2.com/example.html']
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
     >>> import re
     >>> def process ( value ) :
     . . .        m = re . search ( "javascript:goToPage\('(.*?)'" , value )
     . . .        # 如果匹配,就提取其中url 并返回,不匹配则返回原值
     . . .        if m :
     . . .            value = m . group ( 1 )
     . . .        return value
     …
     >>> from scrapy . linkextractors import LinkExtractor
     >>> le = LinkExtractor ( process_value = process )
     >>> links = le . extract_links ( response2 )
     >>> [ link . url for link in links ]
     [ 'http://example2.com/home.html' ,
     'http://example2.com/doc.html' ,
     'http://example2.com/example.html' ]

 

到此,我们介绍完了LinkExtractor构造器的各个参数,实际应用时可以同时使用一个或多个参数描述提取规则,这里不再举例。




  • zeropython 微信公众号 5868037 QQ号 5868037@qq.com QQ邮箱
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值