从页面中提取数据的核心技术是HTTP文本解析,在Python 中常用以下模块处理此类问题:
BeautifulSoup | lxml |
---|---|
非常流行的HTTP解析库,API 简洁易用,但解析速度较慢。 | 由C语言编写的xml解析库( libxml2),解析速度更快,API相对复杂。 |
Scrapy综合上述两者优点实现了Selector 类,它是基于lxml库构建的,并简化了API接口。在Scrapy中使用Selector 对象提取页面中的数据,使用时先通过XPath或CSS选择器选中页面中需要提取的数据,然后进行提取,下面来介绍一下Selector对象的使用。
一、Selector对象
1.1、创建对象
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
html = '''
<!DOCTYPE html>
<html lang="en">
<head>
<title>Scrapy Study</title>
</head>
<body>
<h1>Hello World</h1>
<h2>ayouleyang</h2>
<b>yangyou</b>
<ul>
<li>Python</li>
<li>Scrapy</li>
<li>html</li>
</ul>
<!-- 标签缺失 -->
'''
使用Response对象构造Selector对象,将其传递给Selector构造器方法的response参数:
>>> result = HtmlResponse(html,body=html,encoding='utf-8')
>>> selector = Selector(response = result)
>>> print(selector)
<Selector xpath=None data='<html lang="en">\n<head>\n <title>Scrap'>
>>>
1.2、选中数据
调用Selector对象的xpath或css方法可以选中文中某个或某部分:
>>> selector_h1 = selector.xpath('//h1')
>>> print (selector_h1)
[<Selector xpath='//h1' data='<h1>Hello World</h1>'>]
>>> selector_li = selector.xpath('//li')
>>> print (selector_li)
[<Selector xpath='//li' data='<li>Python</li>'>,
<Selector xpath='//li' data='<li>Scrapy</li>'>,
<Selector xpath='//li' data='<li>html</li>'>]
>>>
xpath和css方法返回一个SelectorList对象,SelectorList支持列表接口,可使用for语句迭代访问其中的对象:
>>> for li in selector_li:
print (li.xpath('./text()'))
[<Selector xpath='./text()' data='Python'>]
[<Selector xpath='./text()' data='Scrapy'>]
[<Selector xpath='./text()' data='html'>]
>>>
SelectorList对象也有xpath和css方法:
>>> lis = selector.xpath('.//ul').css('li').xpath('./text()')
>>> print (lis)
[<Selector xpath='./text()' data='Python'>,
<Selector xpath='./text()' data='Scrapy'>,
<Selector xpath='./text()' data='html'>]
>>>
1.3、提取数据
调用Selector或SelectorList对象的一下方法可将选中的内容提取
- extract()
- re()
- extract_first() (SelectorList专有)
- re_first (SelectorList专有)
extract方法
>>> selector_li = selector.xpath('//li')
>>> print (selector_li)
[<Selector xpath='//li' data='<li>Python</li>'>,
Selector xpath='//li' data='<li>Scrapy</li>'>,
Selector xpath='//li' data='<li>html</li>'>]
>>>
>>> print (selector_li[0].extract())
<li>Python</li>
>>>
>>> li = selector.xpath('.//li/text()')
>>> print (li)
[<Selector xpath='.//li/text()' data='Python'>,
<Selector xpath='.//li/text()' data='Scrapy'>,
<Selector xpath='.//li/text()' data='html'>]
>>>
>>>>print (li.extract())
['Python', 'Scrapy', 'html']
>>>
>>> print (li[0].extract())
Python
>>>
>>> print (li[1].extract())
Scrapy
提取标题内容:
>>> title = selector.xpath('.//title/text()')
>>> print (title)
[<Selector xpath='.//title/text()' data='Scrapy Study'>]
>>> print (title.extract())
['Scrapy Study']
>>> print (title[0].extract())
Scrapy Study
>>>
定点提取ul>li的内容:
>>> html = '''
<ul>
<li>Python编程<b>价格:32.00元</b></li>
<li>精通Scrapy<b>价格:12.00元</b></li>
<li>html知识<b>价格:52.00元</b></li>
</ul>
'''
>>> selector = Selector(text=html)
>>> li = selector.xpath('.//ul/li/text()')
>>> print (li)
[<Selector xpath='.//ul/li/text()' data='Python编程'>,
<Selector xpath='.//ul/li/text()' data=