scrapy框架
1、scrapy安装与环境依赖
# 1.在安装scrapy前需要安装好相应的依赖库, 再安装scrapy, 具体安装步骤如下:
(1).安装lxml库: pip install lxml
(2).安装wheel: pip install wheel
(3).安装twisted: pip install twisted文件路径
***根据网址进入页面后,找到跟自己电脑相匹配的安装包下载,下载成功后,
复制到一个文件夹在地址栏用cmd打开切换到python环境中执行命令 twisted:
pip install twisted文件路径(T+tab键会自动生成)
(twisted需下载后本地安装,下载地 址:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted) (版本选择如下图,版本后面有解释,请根据自己实际选择)
(4).安装pywin32: pip install pywin32
(注意:以上安装步骤一定要确保每一步安装都成功,没有报错信息,
如有报错自行百度解决)
(5).安装scrapy: pip install scrapy
(注意:以上安装步骤一定要确保每一步安装都成功,没有报错信息,
如有报错自行百度解决)
(6).成功验证:在cmd命令行输入scrapy,显示Scrapy1.6.0-no active project,
证明安装成功
2、创建项目等–命令介绍
# 1.手动创建一个目录test
# 2.在test文件夹下创建爬虫项目为spiderpro: *(命令)scrapy startproject spiderpro(项目名称) *
# 3.进入项目文件夹: cd spiderpro
# 4.创建爬虫文件: scrapy genspider 爬虫名 域名(www.baidu.com---意思只能爬取在百度以内的东西)
# 5.启动scrapy的命令:scrapy crawl +'爬虫名'
# 6.解析方法 extract_first()--->目标数据,如果拼错了,不飘红也不报错,就是拿不到数据。
# 7.当在scrapy的框架中,获取列表数据的一条用---extract_first()
# 8.当在scrapy的框架中,获取列表的所有数据用---extract()
3、项目目录介绍
spiderpro
spiderpro # 项目目录
__init__
spiders:爬虫文件目录
__init__
tests.py:爬虫文件
items.py:定义爬取数据持久化的数据结构
middlewares.py:定义中间件
pipelines.py:管道,持久化存储相关
settings.py:配置文件
venv:虚拟环境目录
scrapy.cfg: scrapy 项目部署有关
#说明:
1).spiders:其内包含一个个Spider的实现, 每个Spider是一个单独的文件
2).items.py:它
定义了Item数据结构, 爬取到的数据存储为哪些字段
3).pipelines.py:它定义Item Pipeline的实现
4).settings.py:项目的全局配置
5).middlewares.py:定义中间件, 包括爬虫中间件和下载中间
件
6).scrapy.cfg:它是scrapy项目的配置文件, 其内定义了项目的配置路径,
部署相关的信息等
4、scrapy五大核心组件与数据流向
# 架构:
1).Scrapy Engine: 这是引擎,负责Spiders、ItemPipeline、Downloader、Scheduler中间的通 讯,信号、数据传递等等!
2).Scheduler(调度器): 它负责接受引擎发送过来的requests请求,
并按照一定的方式进行整理排列, 入队、并等待Scrapy Engine(引擎)来请求时,交给引擎。
3).Downloader(下载器):负责下载Scrapy Engine(引擎)发送的所有Requests请求,并将其获取到 的Responses交还给Scrapy Engine(引擎),由引擎交给Spiders来处理,
4).Spiders:它负责处理所有Responses,从中分析提取数据,获取Item字段需要的数据,并将需要跟进 的URL提交给引擎,再次进入Scheduler(调度器),
5).Item Pipeline:它负责处理Spiders中获取到的Item,并进行处理,比如去重,持久化存储(存数据 库,写入文件,总之就是保存数据用的)
6).Downloader Middlewares(下载中间件):你可以当作是一个可以自定义扩展下载功能的组件
7).Spider Middlewares(Spider中间件):你可以理解为是一个可以自定扩展和操作引擎和Spiders中 间‘通信‘的功能组件(比如进入Spiders的Responses;和从Spiders出去的Requests)
# 工作流:
spider --> 引擎 --> 调度器 --> 引擎 --> 下载器 --> 引擎 --> spider --> 引擎 --> 管道 --> 数据库
1).spider将请求发送给引擎, 引擎将request发送给调度器进行请求调度
2).调度器把接下来要请求的request发送给引擎, 引擎传递给下载器, 中间会途径下载中间件
3).下载携带request访问服务器, 并将爬取内容response返回给引擎, 引擎将response返回给 spider
4).spider将response传递给自己的parse进行数据解析处理及构建item一系列的工作, 最后将item 返回给引擎, 引擎传递个pipeline
5).pipe获取到item后进行数据持久化
6).以上过程不断循环直至爬虫程序终止
#__init__初始化方法 __new__() 构造方法 :当spider接收到res响应后定义类,实例化对象存到属性中也就是存在内存上,下一步才准备存到数据库
5、scrapy–爬取科客网站
import scrapy
class ProItem ( scrapy. Item) :
img = scrapy. Field( ) ** ** **
title = scrapy. Field( ) ** ** **
image_url = scrapy. Field( ) ** ** * 所需代码
import scrapy
from . . items import ProItem
class MyproSpider ( scrapy. Spider) :
name = 'mypro'
start_urls = [ 'http://www.keke289.com/' ]
def parse ( self, response) :
div_list = response. xpath( '//article[contains(@class,"article")]' )
for i in div_list:
title = i. xpath( './div/h2/a/text()' ) . extract_first( )
href = i. xpath( './div/h2/a/@href' ) . extract_first( )
src = i. xpath( './div/a/img/@lazy_src' ) . extract_first( )
item = ProItem( )
item[ 'title' ] = title
item[ 'image_url' ] = href
item[ 'img' ] = src
yield item
import pymongo
class ProPipeline ( object ) :
def process_item ( self, item, spider) :
conn = pymongo. MongoClient( 'localhost' , 27017 )
db = conn. keke
table = db. kuke
table. insert_one( dict ( item) )
return item
ROBOTSTXT_OBEY = False
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36'
ITEM_PIPELINES = { 'qsbk.pipelines.QsbkPipeline' : 300 , }
6、scrapy实现多页爬取
1 ) .
例:
start_urls = [ 'http://www.009renti.com/evarenti/RenTiCaiHui/14_%s.html' % for i in range ( 1 , 3 ) ]
2 ) .
page = 1
base_url = 'http://www.xiaohuar.com/list-1-%s.html'
if self. page < 4 :
page_url = base_url% self. page
self. page += 1
yield scrapy. Request( url= page_url, callback= self. parse)
import scrapy
from . . items import BizhiItem
class MybizhiSpider ( scrapy. Spider) :
name = 'mybizhi'
1 ) .
2 ) .
start_urls = [ 'http://sj.zol.com.cn/bizhi/mingxing/1.html' ]
page = 1
def parse ( self, response) :
div_list = response. xpath( '//li[@class="photo-list-padding"]' )
for div in div_list:
title = div. xpath( './a/span/em/text()' ) . extract_first( )
image_url = div. xpath( './a/img/@src' ) . extract_first( )
detail_url = div. xpath( './a/@href' ) . extract_first( )
item = BizhiItem( )
item[ 'title' ] = title
item[ 'image_url' ] = image_url
item[ 'detail_url' ] = detail_url
if self. page < 4 :
self. page += 1
url = 'http://sj.zol.com.cn/bizhi/mingxing/%s.html' % self. page
yield scrapy. Request( url= url, callback= self. parse)
yield item
7、scrapy解析笑话网站例
import scrapy
from . . items import SkillItem
class MyskillSpider ( scrapy. Spider) :
name = 'myskill'
start_urls = [ 'http://www.jokeji.cn/?bmjmxa=ziqzh' ]
def detail_parse ( self, response) :
item = response. meta[ 'item' ]
** ** *
当在scrapy的框架中,获取列表数据的一条用- - - extract_first( )
当在scrapy的框架中,获取列表的所有数据用- - - extract()
** ** *
detail_url = response. xpath( '//span[@id="text110"]/p/text()' ) . extract( )
item[ 'detail_url' ] = '' . join( detail_url)
yield item
def parse ( self, response) :
div_list = response. xpath( '//div[@class="newcontent l_left"]/ul/li' )
for div in div_list:
title = div. xpath( './a/text()' ) . extract_first( )
link = div. xpath( './a/@href' ) . extract_first( )
item = SkillItem( )
item[ 'title' ] = title
item[ 'link' ] = 'http://www.jokeji.cn' + link
yield scrapy. Request( url= 'http://www.jokeji.cn' + link, callback= self. detail_parse, meta= { 'item' : item} )
8、scrapy框架下载图片代码
1 ) . item. py定义字段赋值
import scrapy
class BizhiItem ( scrapy. Item) :
title = scrapy. Field( )
image_url = scrapy. Field( )
detail_url = scrapy. Field( )
2 ) . 爬虫py文件
import scrapy
from . . items import BizhiItem
class MybizhiSpider ( scrapy. Spider) :
name = 'mybizhi'
start_urls = [ 'http://sj.zol.com.cn/bizhi/mingxing/1.html' ]
def pic_parse ( self, response) :
item = response. meta[ 'item' ]
name = item[ 'image_url' ] . split( '/' ) [ - 1 ]
content = response. body
with open ( './imgs/%s' % name, 'wb' ) as f:
f. write( content)
yield item
def parse ( self, response) :
div_list = response. xpath( '//li[@class="photo-list-padding"]' )
for div in div_list:
title = div. xpath( './a/span/em/text()' ) . extract_first( )
image_url = div. xpath( './a/img/@src' ) . extract_first( )
detail_url = div. xpath( './a/@href' ) . extract_first( )
item = BizhiItem( )
item[ 'title' ] = title
item[ 'image_url' ] = image_url
item[ 'detail_url' ] = detail_url
yield scrapy. Request( url= image_url, callback= self. pic_parse, meta= { 'item' : item} )
3 ) . piplines. py 管道py存数据
import pymongo
class BizhiPipeline ( object ) :
def process_item ( self, item, spider) :
conn = pymongo. MongoClient( 'localhost' , 27017 )
db = conn. xxxxx
table = db. yyyyy
table. insert_one( dict ( item) )
return item
4 ) . settings. py配置内容
ROBOTSTXT_OBEY = False
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36'
ITEM_PIPELINES = { 'qsbk.pipelines.QsbkPipeline' : 300 , }
9、scrapy 篡改请求与响应,item丢弃
1).-- UA池--->大量UA----->拦截请求 ---->换UA
2).-- IP代理池---->请求---->换IP
3).-- cookie池---->换cookie
4).-- 拦截响应(动态加载)--selenium抓取(res.scrapy<---->res.selenium)--给引擎--->spider
10、scrapy中间件–分类,作用
# 中间件分类
- 下载中间键:DownloadMiddleware
- 爬虫中间件:SpiderMiddleware
# 中间件的作用
- 下载中间件: 拦截请求与响应, 篡改请求与响应
- 爬虫中间件: 拦截请求与响应, 拦截管道item, 篡改请求与响应, 处理item
# 下载中间件的主要方法:
process_request #获取拦截非异常请求
process_response #获取拦截所有响应
process_exception #获取拦截异常请求
11、下载中间件拦截请求, 使用代理ip案例
1 ) .
import scrapy
class DlproxySpider ( scrapy. Spider) :
name = 'dlproxy'
start_urls = [ 'https://www.baidu.com/s?wd=ip' ]
def parse ( self, response) :
with open ( 'baiduproxy.html' , 'w' , encoding= 'utf-8' ) as f:
f. write( response. text)
2 ) .
def process_request ( self, request, spider) :
request. meta[ 'proxy' ] = 'http://111.231.90.122:8888'
return None
3 ) .
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
ROBOTSTXT_OBEY = False
Downloader_MIDDLEWARES = {
'proxy.middlewares.ProxySpiderMiddleware' : 543 ,
}
12、下载中间件实现UA池
1 ) . 在middlewares. py
from scrapy import signals
from fake_useragent import UserAgent
import random
ua_chrome = UserAgent( )
ua_pool = [ ]
for i in range ( 10 ) :
ua = ua_chrome. Chrome
ua_pool. append( ua)
def process_request ( self, request, spider) :
request. headers[ 'User-Agent' ] = random. choice( ua_pool)
return None
def process_response ( self, request, response, spider) :
print ( '*' * 50 )
** ** * request. headers[ 'User-Agent' ] ** ** * 取ua
print ( request. headers[ 'User-Agent' ] )
print ( '*' * 50 )
return response
2 ) . 需要注释的地方,跟修改的地方 在settings. py 里设置
1. USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
2. ROBOTSTXT_OBEY = False
3. DOWNLOADER_MIDDLEWARES = {
'proxy.middlewares.ProxyDownloaderMiddleware' : 543 ,
}
3 ) .
start_urls = [ 'https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=ip&rsv_pq=cf95e45f000b8d2b&rsv_t=74b1V5e7UWXPDK6YWqzjFSXv%2B9wpMSDHZrF4HMP0TnouyBZ4o6hj%2FuiRWgI&rqlang=cn&rsv_enter=1&rsv_dl=tb&rsv_sug3=2&rsv_sug1=1&rsv_sug7=100&rsv_sug2=0&inputT=1452&rsv_sug4=1453' for i in range ( 3 ) ]
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** *
简单的UA池
from fake_useragent import UserAgent
for i in range ( 10 ) :
USER_AGENT = UserAgent( ) . random
print ( USER_AGENT)
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** *
13、selenium与scrapy框架对接
1 ) . item. py里
import scrapy
class NewsItem ( scrapy. Item) :
title = scrapy. Field( )
image_url = scrapy. Field( )
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** *
2 ) . 在爬虫的py里
import scrapy
from . . items import NewsItem
from selenium import webdriver
class MynewsSpider ( scrapy. Spider) :
name = 'mynews'
start_urls = [ 'https://news.163.com/domestic/' ]
browser = webdriver. Chrome( executable_path= r'D:\爬虫段位\day13\news\chromedriver.exe' )
def image_parse ( self, response) :
item = response. meta[ 'item' ]
content = response. body
name = item[ 'image_url' ] . split( '/' ) [ - 1 ] . split( '?' ) [ 0 ]
with open ( './imgs/%s' % name, 'wb' ) as f:
f. write( content)
yield item
def parse ( self, response) :
div_list = response. xpath( '//div[contains(@class,"news_article")]' )
for div in div_list:
title = div. xpath( './div/div/h3/a/text()' ) . extract_first( )
image_url = div. xpath( './a/img/@src' ) . extract_first( )
item = NewsItem( )
item[ 'title' ] = title
item[ 'image_url' ] = image_url
yield scrapy. Request( url= image_url, callback= self. image_parse, meta= { 'item' : item} )
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** *
3 ) . pipelines. py里 存数据
import pymongo
class NewsPipeline ( object ) :
def process_item ( self, item, spider) :
conn = pymongo. MongoClient( 'localhost' , 27017 )
db = conn. news
table = db. wynews
table. insert_one( dict ( item) )
return item
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** *
4 ) . middlewares. py里
from scrapy import signals
from scrapy. http import HtmlResponse
def process_response ( self, request, response, spider) :
browser = spider. browser
if response. url in spider. start_urls:
browser. get( request. url)
js = 'window.scrollTo(0,document.body.scrollHeight)'
browser. execute_script( js)
html = browser. page_source
return HtmlResponse( url= browser. current_url, body= html, encoding= 'utf-8' , request= request)
return response
** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** *
5 ) . settings. py设置
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
ROBOTSTXT_OBEY = False
DOWNLOADER_MIDDLEWARES = {
'news.middlewares.NewsDownloaderMiddleware' : 543 ,
}
ITEM_PIPELINES = {
'news.pipelines.NewsPipeline' : 300 ,
}
14、scrapy 持久化与MongoDB交互
open_spider( self, spider) : spider开启是被调用 close_spider( self, spider) : spider关闭是被调用 from_crawler( cls, crawler) : 类方法, 用@classmethod 标识, 可以获取配置信息
Process_item( self, item, spider) : 与数据库交互存储数据, 该方法必须实现
** ** *
1 ) .
import pymongo
class XiaoxiaoPipeline ( object ) :
def __init__ ( self, mongo_uri, mongo_db) :
self. mongo_uri = mongo_uri
self. mongo_db = mongo_db
def open_spider ( self, spider) :
self. client = pymongo. MongoClient( self. mongo_uri)
self. db = self. client[ self. mongo_db]
@classmethod
def from_crawler ( cls, crawler) :
return cls(
mongo_uri = crawler. settings. get( 'MONGO_URI' ) ,
mongo_db = crawler. settings. get( 'MONGO_DB' )
)
def process_item ( self, item, spider) :
self. db[ 'myxiao' ] . insert( dict ( item) )
return item
def close_spider ( self, spider) :
self. client. close( )
2 ) .
import scrapy
from . . items import XiaoxiaoItem
class MyxiaoSpider ( scrapy. Spider) :
name = 'myxiao'
start_urls = [ 'http://duanziwang.com/' ]
def parse ( self, response) :
div_list = response. xpath( '//article[@class="post"]' )
for div in div_list:
title = div. xpath( './div/h1/a/text()' ) . extract_first( )
cont = div. xpath( './div[2]/p/text()' ) . extract( )
content = '' . join( cont)
item = XiaoxiaoItem( )
item[ 'title' ] = title
item[ 'content' ] = content
yield item
3 ) .
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
'xiaoxiao.pipelines.XiaoxiaoPipeline' : 300 ,
}
MONGO_URI = 'localhost'
MONGO_DB = 'xiaoxiao'
15、scrapy 持久化与Mysql交互
1 ) .
import pymysql
class MyXiaoxiaoPipeline ( object ) :
def __init__ ( self, host, database, user, password, port) :
self. host = host
self. database = database
self. user = user
self. password = password
self. port = port
def open_spider ( self, spider) :
self. client = pymysql. connect( self. host, self. user, self. password, self. database, charset= 'utf8' , port= self. port)
self. corsor = self. client. cursor( )
@classmethod
def from_crawler ( cls, crawler) :
return cls(
host= crawler. settings. get( 'MYSQL_HOST' ) ,
database = crawler. settings. get( 'MYSQL_DATABASE' ) ,
user = crawler. settings. get( 'MYSQL_USER' ) ,
password = crawler. settings. get( 'MYSQL_PASSWORD' ) ,
port = crawler. settings. get( 'MYSQL_PORT' )
)
def process_item ( self, item, spider) :
data = dict ( item)
keys = ',' . join( data. keys( ) )
values = ',' . join( [ '%s' ] * len ( data) )
sql = 'insert into %s (%s) values (%s)' % ( 'myxiao' , keys, values)
self. corsor. execute( sql, tuple ( data. values( ) ) )
self. client. commit( )
return item
2 ) . 在settings. py 里
ITEM_PIPELINES = {
'xiaoxiao.pipelines.MyXiaoxiaoPipeline' : 295 ,
}
MYSQL_HOST = 'localhost'
MYSQL_DATABASE = 'xiaoxiao'
MYSQL_USER = 'root'
MYSQL_PASSWORD = ''
MYSQL_PORT = 3306
ROBOTSTXT_OBEY = False
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36' \
16、基于crawlSpider的全站数据爬取
1 ) .
scrapy startproject projectname
scrapy genspider - t crawl spidername www. baidu. com
2 ) .
- CrawlSpider是一个爬虫类, 是scrapy. spider的子类, 功能比spider更强大.
- CrawlSpider的机制:
- 连接提取器: 可以根据指定的规则进行连接的提取
- 规则解析器: 根据指定的规则对响应数据进行解析
3 ) )
import scrapy
class JokeItem ( scrapy. Item) :
title = scrapy. Field( )
content = scrapy. Field( )
import scrapy
from scrapy. linkextractors import LinkExtractor
from scrapy. spiders import CrawlSpider, Rule
from . . items import JokeItem
class ZSpider ( CrawlSpider) :
name = 'z'
start_urls = [ 'http://xiaohua.zol.com.cn/lengxiaohua/' ]
link = LinkExtractor( allow= r'/lengxiaohua/\d+.html' )
link_detail = LinkExtractor( allow= r'.*?\d+\.html' )
rules = ( Rule( link, callback= 'parse_item' , follow= True ) , Rule( link_detail, callback= 'parse_detail' ) , )
def parse_item ( self, response) :
pass
def parse_detail ( self, response) :
title = response. xpath( '//h1[@class="article- title"]/text()' ) . extract_first( )
content = response. xpath( '//div[@class="article- text"]//text()' ) . extract( )
content = '' . join( content)
if title and content:
item = JokeItem( )
item[ "title" ] = title
item[ "content" ] = content
print ( dict ( item) )
yield item
class JokePipeline ( object ) :
def __init__ ( self, mongo_uri, mongo_db) :
self. mongo_uri = mongo_uri
self. mongo_db = mongo_db
@classmethod
def from_crawler ( cls, crawler) :
return cls(
mongo_uri= crawler. settings. get( 'MONGO_URI' ) , mongo_db= crawler. settings. get( 'MONGO_DB' ) )
def open_spider ( self, spider) :
self. client = pymongo. MongoClient( self. mongo_uri)
self. db = self. client[ self. mongo_db]
def process_item ( self, item, spider) : self. db[ "joke" ] . insert( dict ( item) )
return item
def close ( self, spider) :
self. client. close( )
17、增量式爬虫
- 检测网站数据更新, 只爬取更新的内容
- 核心:
去重
- url
- 数据指纹
import scrapy
class MoveItem ( scrapy. Item) :
title = scrapy. Field( )
lab = scrapy. Field( )
import scrapy
from . . items import MoveItem
from redis import Redis
class MymoveSpider ( scrapy. Spider) :
name = 'mymove'
start_urls = [ 'https://www.4567tv.co/list/index1.html' ]
conn = Redis( 'localhost' , 6379 )
def detail_parse ( self, response) :
title = response. xpath( '//div[@class="ct-c"]/dl/dt/text()' ) . extract_first( )
lab = response. xpath( '//div[@class="ee"]/text()' ) . extract_first( )
item = MoveItem( )
item[ 'title' ] = title
item[ 'lab' ] = lab
yield item
def parse ( self, response) :
link = response. xpath( '//div[contains(@class,"index-area")]/ul/li/a/@href' ) . extract( )
for i in link:
ret = self. conn. sadd( 'link' , i)
if ret:
print ( '有新数据, 可以爬取---------------------------------' )
yield scrapy. Request( url= 'https://www.4567tv.co' + i, callback= self. detail_parse)
else :
print ( '没有数据更新, 不需要爬取###############################' )
import pymongo
class MovePipeline ( object ) :
def process_item ( self, item, spider) :
conn = pymongo. MongoClient( 'localhost' , 27017 )
db = conn. move
table = db. mv
table. insert_one( dict ( item) )
return item
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
'move.pipelines.MovePipeline' : 300 ,
}
19、mongo的简易分组聚合统计用django显示
import pymongo
conn = pymongo. MongoClient( 'localhost' , 27017 )
db = conn. fqxh
table = db. xh
def login ( request) :
res = table. find( )
return render( request, 'aaa.html' , locals ( ) )
def index ( request) :
res = table. find( ) . sort( [ ( 'times_date' , pymongo. ASCENDING) ] )
return render( request, 'index.html' , locals ( ) )
def indexs ( request) :
ret = table. aggregate( [ { '$group' : { '_id' : '$times_date' , 'cc' : { '$sum' : '$count' } } } ] )
li = [ ]
for i in ret:
i[ 'date' ] = i[ '_id' ]
li. append( i)
return render( request, 'indexs.html' , locals ( ) )
def total ( request) :
res = table. aggregate( [ { '$group' : { '_id' : '$times_date' , 'cc' : { '$sum' : 1 } } } ] )
li = [ ]
for i in res:
i[ 'date' ] = i[ '_id' ]
li. append( i)
return render( request, 'ccc.html' , locals ( ) )