python 第二周（第十天）我的python成长记一个月搞定python数据挖掘！(18) -mongodb...-优快云博客

本文详细介绍了Scrapy框架中选择器的使用方法，包括XPath和CSS选择器的应用实例，通过具体示例展示了如何提取网页元素的文本内容、使用正则表达式匹配数据以及处理相对XPath路径。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1. 首先导入工具
from scrapy.selector import Selector

2. selectors的使用
实例：response.selector.xpath('//span/text()').extract()

    (1)选择title标签中text的文本内容
        response.selector.xpath('//title/text()')
        提供两个更简单的方法
            response.xpath('//title/text()')
            response.css('title::text')
        例子：
            response.css('img').xpath('@src').extract()
            response.xpath('//div[@id="images"]/a/text()').extract_first()
            response.xpath('//div[@id="not-exists"]/text()').extract_first(default='not-found')
    (2)使用正则匹配的
        response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
        response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')
    (3)Working with relative XPaths
        divs = response.xpath('//div')
        for p in divs.xpath('.//p'):
             print p.extract()
        for p in divs.xpath('p'):
             print p.extract()
    (4)
    (5)

官方实例：
>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.extract()
[u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
 u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
 u'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
 u'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
 u'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']

>>> for index, link in enumerate(links):
...     args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
...     print 'Link number %d points to url %s and image %s' % args

Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg']
Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg']
Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg']
Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg']
Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']