scrapy
scrapy框架是专门为python爬虫所设计的框架,它可以实现多线程爬虫,异步请求运行,虽然不用scrapy框架也可以实现多线程爬虫,但是功能非常的鸡肋,也比较麻烦,而scrapy就可以很简单的实现了多线程爬虫,还有许多强大的功能,不懂的也可以取scrapy中文网上面了解 https://yiyibooks.cn/zomin/Scrapy15/index.html
scrapy-redis
scrapy-redis是基于scrapy开发的一个功能,它可以实现断点续爬,url去重,持久化爬虫,分布式爬虫,而且使用也非常简单,要先去官网把源码下载下来https://github.com/rmax/scrapy-redis,而主要文件只有三个
而这次比较重要的文件就是scrapy_redis这个文件,下面进行演示scrapy-redis爬取京东图书信息
建立scrapy爬虫项目
1、首先在你要建立的目录下面打开电脑终端,就是用cmd打开,然后建立爬虫项目,在电脑终端里面写上 scrapy startproject jingdong 后面这个jingdong就是项目名称
2、然后cd进去这个项目里面建立爬虫文件,在终端上面写 scrapy genspider jdbook jd.com 其中jdbook 就是爬虫文件名,jd.com就是爬虫的域名范围,我们建立爬虫文件时一定要给一个域名范围,防止爬虫到一些乱七八糟的url上。
3、要把scrapy-redis官网下载的源码中的scrapy_redis文件复制到项目当中去,因为有很多访问要调用里面的函数
开始配置setting
1、首先要配置的第一步就是ROBOTSTXT_OBEY,这个要设置为False,原本是为True,如果这个为True的话,就会在爬取网站之前会先去网站的根目录下面寻找一个robots.txt文件,如果找不到就不会往下面运行
2、模拟浏览器,就是要设置User-Agent和一些请求头
原本这些都是注释起来的,现在把他们都打开然后把值改为浏览器的值,这里我们就不用设置cookies如果要设置cookies和代理ip就要在下载中间键里面设置,也就是在middlewares.py这个文件里面设置,今天爬虫京东图书不用设置这些,因为京东的反爬机制做得不是那么好,如果想加cookies和代理IP也可以自己添加
3、设置redis数据库和持久化爬虫的一些参数
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" #定义一个去重的类,用来将url去重
SCHEDULER = "scrapy_redis.scheduler.Scheduler" #指定队列
SCHEDULER_PERSIST = True #将程序持久化保存
REDIS_URL = "redis://127.0.0.1:6379"
DUPEFILTER_CLASS是一个去重类,通过这个类就可以添加url,并把url进行去重,就是访问过的url不再访问,这里给大家简单地讲解一些去重规则,首先通过加密方法将url地址加密成一串字符串然后作为指纹,然后存储进redis数据库里面,然后新的url地址也要通过加密然后成一串字符串,如果数据库里面有这一个指纹就不会访问这个url地址了,如果没有就访问这个url,然后再把指纹存进数据库
SCHEDULER 这个是一个队列,通过队列的方式把访问url地址得到一个request对象,然后把request对象存进队列中,然后一个一个再把request对象提取出来,就是这里实现多线程爬虫
SCHEDULER_PERSIST 这个就是要将程序持久化保存,如果一个正常的scrapy爬虫,如果不传递这个参数,那么当程序结束了,也会把数据库清空,就达不到程序持久化保存,和断点续爬的效果
REDIS_URL 这个就是redis的地址
4、如果想要保存数据还要把数据管道给开了
ITEM_PIPELINES = {
'jingdong.pipelines.JingdongPipeline': 300,
}
开始分析网站
首先我们进入到图书分类的官网
https://book.jd.com/booksort.html
然后我们可以看到每个图书大分类下面都有很多的小分类,然后我们可以按f12看一些数据在不在这个url地址里面
然后我们可以看到数据都在里面,包括大分类的名字和url地址和小分类的名字和url地址
提取图书大分类和小分类
首先提取大分类的名字
all_list = response.xpath("//*[@id='booksort']/div[2]/dl/dt") #获取大分类
for i in all_list:
item = JingdongItem()
item['all_class'] = i.xpath("./a/text()").extract_first()
然后我们发现小分类是大分类的兄弟节点的下一个节点,那我们就可以这样提取
然后提取小分类的名字和url地址
class_list = i.xpath("./following-sibling::dd[1]/em") #用兄弟节点的方式来获取同一节点的下一个节点
for j in class_list:
item["next_class_name"] = j.xpath("./a/text()").extract_first() #获取小分类的名字
item["next_class_url"] = "https:" + "".join(j.xpath("./a/@href").extract_first())
进入小分类的url里面,获取图书部分信息
然后我们就进入到小分类的详情页里面,查看里面有多少数据是我们要的
然后我们再进入到网页源码里面看一下有没有我们想要的数据
然后就可以发现里面有图书的名字、url地址、作者名、出版社的名字、店铺的名字和url地址、出版时间,这些数据都是我们想要,就先把他们提取出来,另外的图书价格和评论数可以去图书的详情页里面获得
def next_parse(self,response):
item = response.meta["item"]
book_list = response.xpath("//div[@id='plist']/ul[@class='gl-warp clearfix']/li")
for i in book_list:
try:
item['book_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/em/text()").extract_first().strip() #去掉空格和换行符
except:
item['book_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/em/text()").extract_first()
item['book_url'] = "https:" + "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/@href").extract())
item['publisher_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-store']/a/text()").extract_first()
item['publisher_url'] ="https:" + "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-store']/a/@href").extract()) if item['publisher_name'] is not None else None
try:
item["publish_time"] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-date']/text()").extract_first().strip()
except:
item["publish_time"] = i.xpath(
"./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-date']/text()").extract_first()
item['author'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-name']/span/a/text()").extract()
获得图书的价格和评论数
我们先分析一下价格和评论数在哪里,经过我的分析,我发现价格和评论数在图书的url里面没有,它是通过js生成的json数据,我们可以找一下它生成在哪里,经过我分析发现价格在这个网址里面https://p.3.cn/prices/mgets?type=1&skuIds=J_11892005346&pdtk=&pduid=1551774170386597393748&pdpin=&pdbp=0&callback=jQuery6622062&_=1560704913535
经过分析这个url地址有些参数是不需要的,最终的url是https://p.3.cn/prices/mgets?skuIds=J_11892005346
然后每本书的价格只要修改J_后面的参数就可以获得每本书的价格,而这个参数在小分类的url地址里面存在,只要把参数提取出来就可以获得响应的书本的价格,里面有书本的原价和购买价格
def parse_dateli(self,response):
item = response.meta["item"]
js = json.loads(response.body)
item['original_price'] = js[0]['m']
item['price'] = js[0]['p']
提取评论数
书本的评论数同样是生成了json数据,所以我们可以按照刚才寻找价格的方法来寻找商品的评论数,这里就不详细说了,不懂也可以来问我
def parse_comment(self,response):
item = response.meta["item"]
js = json.loads(response.text)
item['comment'] = js['CommentsCount'][0]['CommentCount']
设置items
items是数据管道,数据管道的使用也是非常简单,先在数据管道里面设定一些参数
import scrapy
class JingdongItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
all_class=scrapy.Field()
next_class_name = scrapy.Field()
next_class_url = scrapy.Field()
book_name = scrapy.Field()
book_url = scrapy.Field()
comment = scrapy.Field()
price = scrapy.Field()
publisher_name = scrapy.Field()
publisher_url = scrapy.Field()
publish_time = scrapy.Field()
author = scrapy.Field()
original_price = scrapy.Field()
然后在爬虫文件里面导入这个文件的JingdongItem类,然后用一个参数接收,比如
item = JingdongItem()
记得这个参数是在爬虫文件里面设置的,下面提供源码,就可以知道在哪里设置的
pepelines 的设置
pipelines主要是用来保存数据的,这些数据可以保存在mysql数据库、mongodb数据库,或者保存在csv文件中,我这里是保存在csv文件上
import pandas as pd
class JingdongPipeline(object):
def open_spider(self,spider):
print("开始运行")
def process_item(self, item, spider):
print(item)
data_list = []
data_list.append(dict(item))
# for i in self.list:
# if i not in self.data_list:
# self.data_list.append(i)
data_frame = pd.DataFrame(data_list)
# data_frame["id"] = self.id
# data_frame.set_index("id",drop=True)
data_frame.index.name = "id"
pd.DataFrame.to_csv(data_frame, "I:/crack/DATA/jdbook.csv", mode='a', index=False,header=0,encoding="utf_8_sig")
print("写入成功")
return item
def close_spider(self,spider):
print("运行结束")
注意
注意两点
1、域名范围,这里的小分类和图书信息的url地址的域名都是不同的,所以要在上面的allowed_domains添加域名进去
2、yield scrapy.Request的使用,这里面设置到一个参数就是meta,因为要传递值给下面的函数接着使用,所以要传递这个参数,举一个例子
yield scrapy.Request(url=item["next_class_url"],callback=self.next_parse,meta={"item":deepcopy(item)})
这里用到了copy模块的一个方法,就是deepcopy,这里要用deepcopy把数据给拷贝起来,然后传递给下面的函数,如果不把它给拷贝下来,就会导致数据错乱,因为scrapy是一个异步请求的过程,所以极有可能就是上面已经处理好第三个数据了,但是还没开始传递第一个数据,就会导致数据交叉在一起,所以先把数据给拷贝下来就不会导致数据错乱的情况发生
爬虫文件源码
import scrapy
from copy import deepcopy
import re
import json
from jingdong.items import JingdongItem
class JdbookSpider(scrapy.Spider):
name = 'jdbook'
allowed_domains = ['jd.com','p.3.cn','club.jd.com']
start_urls = ['https://book.jd.com/booksort.html']
def parse(self, response):
all_list = response.xpath("//*[@id='booksort']/div[2]/dl/dt") #获取大分类
for i in all_list:
item = JingdongItem()
item['all_class'] = i.xpath("./a/text()").extract_first()
class_list = i.xpath("./following-sibling::dd[1]/em") #用兄弟节点的方式来获取同一节点的下一个节点
for j in class_list:
item["next_class_name"] = j.xpath("./a/text()").extract_first() #获取小分类的名字
item["next_class_url"] = "https:" + "".join(j.xpath("./a/@href").extract_first())
yield scrapy.Request(url=item["next_class_url"],callback=self.next_parse,meta={"item":deepcopy(item)})
def next_parse(self,response):
item = response.meta["item"]
book_list = response.xpath("//div[@id='plist']/ul[@class='gl-warp clearfix']/li")
for i in book_list:
try:
item['book_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/em/text()").extract_first().strip() #去掉空格和换行符
except:
item['book_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/em/text()").extract_first()
item['book_url'] = "https:" + "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/@href").extract())
item['publisher_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-store']/a/text()").extract_first()
item['publisher_url'] ="https:" + "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-store']/a/@href").extract()) if item['publisher_name'] is not None else None
try:
item["publish_time"] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-date']/text()").extract_first().strip()
except:
item["publish_time"] = i.xpath(
"./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-date']/text()").extract_first()
item['author'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-name']/span/a/text()").extract()
data_sku = "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/@data-sku").extract())
yield scrapy.Request(url="https://p.3.cn/prices/mgets?skuIds=J_{}".format(data_sku),callback=self.parse_dateli,meta={"item":deepcopy(item)})
next_page_url = "https://list.jd.com"+ "".join(response.xpath("//*[@id='J_bottomPage']/span[@class='p-num']/a[@class='pn-next']/@href").extract())
judge = "".join(response.xpath("//*[@id='J_bottomPage']/span[@class='p-num']/a[@class='pn-next']/@href").extract())
if judge is not None:
yield scrapy.Request(url=next_page_url,callback=self.next_parse,meta={"item":deepcopy(item)})
def parse_dateli(self,response):
item = response.meta["item"]
js = json.loads(response.body)
item['original_price'] = js[0]['m']
item['price'] = js[0]['p']
id = js[0]["id"]
id = "".join(re.findall(r'\d\.*\d*',id))
yield scrapy.Request(url="https://club.jd.com/comment/productCommentSummaries.action?referenceIds={}".format(id),callback=self.parse_comment,meta={"item":deepcopy(item)})
def parse_comment(self,response):
item = response.meta["item"]
js = json.loads(response.text)
item['comment'] = js['CommentsCount'][0]['CommentCount']
yield deepcopy(item)
# print(deepcopy(item))
setting源码
# -*- coding: utf-8 -*-
# Scrapy settings for jingdong project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'jingdong'
SPIDER_MODULES = ['jingdong.spiders']
NEWSPIDER_MODULE = 'jingdong.spiders'
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" #定义一个去重的类,用来将url去重
SCHEDULER = "scrapy_redis.scheduler.Scheduler" #指定队列
SCHEDULER_PERSIST = True #将程序持久化保存
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_LEVEL = "DEBUG"
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Accept-Language': 'en',
}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'jingdong.middlewares.JingdongSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'jingdong.middlewares.JingdongDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'jingdong.pipelines.JingdongPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
REDIS_URL = "redis://127.0.0.1:6379" #要把数据写入redis数据库还要添加这个参数
items源码
import scrapy
class JingdongItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
all_class=scrapy.Field()
next_class_name = scrapy.Field()
next_class_url = scrapy.Field()
book_name = scrapy.Field()
book_url = scrapy.Field()
comment = scrapy.Field()
price = scrapy.Field()
publisher_name = scrapy.Field()
publisher_url = scrapy.Field()
publish_time = scrapy.Field()
author = scrapy.Field()
original_price = scrapy.Field()
pipelines源码
import pandas as pd
class JingdongPipeline(object):
def open_spider(self,spider):
print("开始运行")
def process_item(self, item, spider):
print(item)
data_list = []
data_list.append(dict(item))
# for i in self.list:
# if i not in self.data_list:
# self.data_list.append(i)
data_frame = pd.DataFrame(data_list)
# data_frame["id"] = self.id
# data_frame.set_index("id",drop=True)
data_frame.index.name = "id"
pd.DataFrame.to_csv(data_frame, "I:/crack/DATA/jdbook.csv", mode='a', index=False,header=0,encoding="utf_8_sig")
print("写入成功")
return item
def close_spider(self,spider):
print("运行结束")