本来打算爬取糗事百科里面的段子,但是程序报错了:ERROR: Spider error processing <GET https://www.qiushibaike.com/text/> (referer: None)
由于一些笔记写在爬取糗事百科代码的注释里面,所以也把爬取糗事百科的代码写出来:
import scrapy
class QiubaiSpider(scrapy.Spider):
name = 'duanzi'
#allowed_domains = ['www.xxx.com']
start_urls = ['https://www.qiushibaike.com/text/']
#parse:解析
#解析糗事百科作者的名称和段子内容
def parse(self, response):
div_list = response.xpath('//div[@id="content"]/div/div[2]')
for div in div_list:
#xpath返回的是列表,但是列表元素是Selector类型的对象
#extract可以将Selector对象中data参数存储的字符串提取出来
author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
#列表调用extract之后,将列表中每一个Selector对象中data参数存储的字符串提取出来
content = div.xpath('./a[1]/div/span//text()').extract() #列表
content = ''.join(content) #将列表元素合并
print(author,content)
break
爬取全国热门城市的代码可以正常地运行:
import scrapy
class CitynameSpider(scrapy.Spider):
name = 'cityname'
#allowed_domains = ['www.baidu.com']
start_urls = ['https://www.aqistudy.cn/historydata/']
#解析热门城市的名称
def parse(self, response):
hot_list = response.xpath('//div[@class="bottom"]/ul/li')
for li in hot_list:
#hot_city_name = li.xpath('./a/text()')[0]
hot_city_name = li.xpath('./a/text()')[0].extract()
print(hot_city_name)
没有使用extract之前的输出:
使用extract之后的输出: