{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Selector的用法"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"前面介绍过利用BeautifulSoup和PyQuery以及正则表达式来提取网页数据,非常方便。而Scrapy也有自己的提取数据的方法,即Selector选择器。Select是基于lxml来构建的,支持XPath选择器、CSS选择器以及正则表达式,功能齐全,解析速度和准确度非常高。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"本来可以采用scrapy shell来调试选择器的使用方法。也可以直接使用Selector模块直接模拟。官网也提供了相应的方法:http://scrapy-chs.readthedocs.io/zh_CN/latest/topics/selectors.html 。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"关于XPath的语法和运算符可以参考:http://www.runoob.com/xpath/xpath-tutorial.html"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. 两种选择器"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"由于在response中使用XPath、CSS查询十分普遍,因此,Scrapy提供了两个实用的快捷方式: response.xpath() 及 response.css():"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[<Selector xpath='//title/text()' data='Example website'>]\n",
"[<Selector xpath='descendant-or-self::title/text()' data='Example website'>]\n"
]
}
],
"source": [
"from scrapy import Selector\n",
"html='''\n",
"<html>\n",
" <head>\n",
" <base href='http://example.com/' />\n",
" <title>Example website</title>\n",
" </head>\n",
" <body>\n",
" <div id='images'>\n",
" <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>\n",
" <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>\n",
" <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>\n",
" <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>\n",
" <a href='image5.html'>Name: My image 5 <br /><img src='
python之scrapy(二)选择器的使用
最新推荐文章于 2021-03-02 02:21:59 发布