- 基于Spider的全站数据爬取:
- 就是将网站中某板块下的全部页码对应的页面数据进行爬取
- 实现方式:
- 将要爬取的所有页面的URL添加到start_urls列表(不推荐)
- 自行手动进行请求发送
- yield scrapy.Request(url=new_url,callback=self.parse)
需求:爬取校花网中明星写真板块下的所有得套图名称。
url:http://www.521609.com/tuku/mxxz/
代码如下:
- 首先创建项目:scrapy startproject xiaohuaPro
- 进入该项目中:cd xiaohuaPro
- 创建爬虫:scrapy genspider xiaohua xiaohua.com
- 修改settings
# Scrapy settings for xiaohuaPro project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'xiaohuaPro'
SPIDER_MODULES = ['xiaohuaPro.spiders']
NEWSPIDER_MODULE = 'xiaohuaPro.spiders'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36 Edg/89.0.774.68'
LOG_LEVEL =