Scrapy框架详解：从创建到分布式爬虫过程,-优快云博客

本文链接：https://blog.youkuaiyun.com/Python9724/article/details/131447520

scrapy框架流程

创建项目

scrapy startproject Tencent （Tencent项目名）

py文件类别

创建爬虫文件模板

到项目路径下 D:\scrapy\Tencent>scrapy genspider tancent tancent.com

运行 scrapy genspider tancent tancent.com （tancent.com对应的域名）

启动项目在对应的文件夹下

第一种输入：scrapy crawl name （name是项目的名字）

第二种启动：创建run.py

在run中输入

from scrapy import cmdline

cmdline.execute('scrapy crawl baidu'.split())

流程图

Scrapy 选择器

选择器提供2个方法来提取标签

xpath() 基于xpath的语法规则

css() 基于css选择器的语法规则

快捷方式

response.xpath()

response.css()

它们返回的选择器列表

提取文本：

selector.extract() 返回文本列表

selector.extract_first() 返回第一个selector的文本，没有返回

None

selector.get()

selector.getall()

嵌套选择器

有时候我们获取标签需要多次调用选择方法（.xpath()或.css()）

response.css('img').xpath('@src')

Selector还有一个.re()方法使用正则表达式提取数据的方法。

它返回字符串。

它一般使用在xpath()，css()方法之后，用来过滤文本数据。

re_first()用来返回第一个匹配的字符串。

日志器

日志文件配置

LOG_FILE 日志输出文件，如果为None，就打印在控制台

LOG_ENABLED 是否启用日志，默认True

LOG_ENCODING 日志编码，默认utf-8

LOG_LEVEL 日志等级，默认debug

LOG_FORMAT 日志格式

%(levelno)s 打印日志级别的数值

%(levelname)s 打印日志级别名称

%(pathname)s 打印当前执行程序的路径，其实就是sys.argv[0]

%(filename)s 打印当前执行程序名

%(funcName)s 打印日志的当前函数

%(lineno)d 打印日志的当前行号

%(asctime)s 打印日志的记录时间

%(thread)d 打印线程ID

%(threadName)s 打印线程的名称

%(process)d 打印进程的ID

%(message)s 打印日志的信息

LOG_DATEFORMAT 日志日期格式

LOG_STDOUT 日志标准输出，默认False，如果True所有标准输

出都将写入日志中

LOG_SHORT_NAMES 短日志名，默认为False，如果True将不输出

组件名

项目中一般设置：

LOG_FILE = 'logfile_name'

LOG_LEVEL = 'INFO'

次级页面提取

Scrapy.http.Request类是scrapy框架中request的基类。它的参数如

下： from scrapy.http import Request,FormRequest

Request

 def parse(self, response):
        # 在parse 中我们只需要关注怎么解析就行，因为response这个对象就有xpath属性
        node_list = response.xpath('//div[@class="info"]')
        if node_list:
            for i in node_list:
                # 标题
                movie_title = i.xpath('./div/a/span/text()').get()
                # 导演
                director = i.xpath('./div/p/text()').get().strip().replace(' ', ' ')
                # 分数
                score = i.xpath('.//span[@class="rating_num"]/text()').get()

                tong = {}
                tong['movie_title'] = movie_title
                tong['director'] = director
                tong['score'] = score

                # 电影详情页
                detail_url = i.xpath('./div/a/@href').get()
                yield scrapy.Request(detail_url, callback=self.get_detail, meta={"info":tong})
                # {"info":{"movie_title":"肖生克的救赎",'director':'导演的信息','score':'9.7'}}
            self.page+=1
            page_url = 'https://movie.douban.com/top250?start={}&filter='.format(self.page*25)
            yield scrapy.Request(page_url,callback=self.parse)
        else:
            return

    # 专门负责解析详情页的内容（次级页面解析函数）
    def get_detail(self, response):
        item = MySpiderItem()
        info = response.meta.get("info")
        item.update(info)
        desc = response.xpath('//span[@property="v:summary"]/text()').get().strip()
        item['desc'] = desc

        yield item

url（字符串） - 此请求的URL

callback（callable）- 回调函数

method（string） - 此请求的HTTP方法。默认为'GET'。

meta（dict） - Request.meta属性的初始值。

body（str 或unicode） - 请求体。如果没有传参，默认为空字符

串。

headers（dict） - 此请求的请求头。

cookies (dict / [dict])- 请求cookie。

encoding（字符串） - 此请求的编码（默认为'utf-8'）此编码将用

于对URL进行编码并将body转换为bytes（如果给定unicode）。

priority（int） - 此请求的优先级（默认为0）,数字越大优先级越

高。