scrapy框架爬取小说

首先创建相关项目文件,打开cmd输入以下命令:

scrapy startproject  项目名称

接着切换到目录文件:

cd 项目名称

定义要爬取的网站:

scrapy genspider 爬虫名称  起始url网站(域名)

过程如下:

C:\Users\Administrator\Desktop\scrapy>scrapy startproject xiaoshuo_text
New Scrapy project 'xiaoshuo_text', using template directory 'c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\scrapy\templates\project', created in:
    C:\Users\Administrator\Desktop\scrapy\xiaoshuo_text

You can start your first spider with:
    cd xiaoshuo_text
    scrapy genspider example example.com

C:\Users\Administrator\Desktop\scrapy>cd xiaoshuo_text
C:\Users\Administrator\Desktop\scrapy\xiaoshuo_text>scrapy genspider xiaoshuo 81zw.com
Created spider 'xiaoshuo' using template 'basic' in module:
  xiaoshuo_text.spiders.xiaoshuo

工程文件如下:
在这里插入图片描述

工程文件准备完成,第一步先设置好settings.py文件,主要设置以下:

(1)LOG_LEVEL = “WARNING” 设置记录日志只显示警告级别信息;
(2)设置爬虫请求头 USER_AGENT 及登录需的cookies信息 DEFAULT_REQUEST_HEADERS;
(3)打开传输管道 ITEM_PIPELINES;
(4)存储运行消息 LOG_FILE = “./log.log”;
(5)设置下载的等待时间DOWNLOAD_DELAY,降低访问的频率

# Scrapy settings for xiaoshuo_text project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from fake_useragent import UserAgent
import random

BOT_NAME = 'xiaoshuo_text'

SPIDER_MODULES = ['xiaoshuo_text.spiders']
NEWSPIDER_MODULE = 'xiaoshuo_text.spiders'

# 设置记录日志只显示警告级别信息
LOG_LEVEL = "WARNING"

# 存储运行消息
LOG_FILE = "./log.log"

ua = UserAgent()
header = {"User-Agent": ua.random ,}
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = ua

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 设置下载的等待时间,降低访问的频率
DOWNLOAD_DELAY = 2
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'xiaoshuo_text.middlewares.XiaoshuoTextSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'xiaoshuo_text.middlewares.XiaoshuoTextDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

第二步获取对应数据,编写xiaoshuo.py文件,例如爬取的网页为xxxx.html,可以设置参数start_urls为该网址,代码如下:

import scrapy

class XiaoshuoSpider(scrapy.Spider):
    name = 'xiaoshuo'
    allowed_domains = ['xxxx.com']
    start_urls = ['xxxxx.html']

    def parse(self, response):
        # 获取章节的标题和内容
        chapter_title = response.xpath("//h1/text()").extract_first()
        chapter_content = "".join(response.xpath("//*[@id='content']/text()").extract()).replace("\u3000\u3000","\n     ")

        # 构建字典,使用调度器进行传输
        data = {}
        data["title"] = chapter_title
        data["content"] = chapter_content
        # print(data)
        yield data

        # 获取下一章内容,循环下载
        next_chapter = response.xpath('//div[@class="bookname"]/div[1]/a[3]/@href').extract_first()
        # base_url = "https://www.81zw.com/{}".format(next_chapter)
        if next_chapter.find(".html") != 1:
            yield scrapy.Request(response.urljoin(next_chapter), callback=self.parse)

第三步处理数据并保持在本地,编辑pipelines.py文件,接受xiaoshuo.py传来的数据并保存,代码如下:

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class XiaoshuoTextPipeline:
    def open_spider(self, spider):
        self.file = open("wddf.txt", "w", encoding="utf-8")

    def process_item(self, item, spider):
        title = item.get("title")
        content = item.get("content")
        print(title)
        info = title + "\n" + "     " + content + "\n"
        self.file.write(info)
        # 从内存中取出数据,相对与刷新一下
        self.file.flush()
        return item

    def close_spider(self, spider):
        self.file.close()

最后运行scrapy框架进行爬取,运行有两种方式,第一种打开cmd输入以下代码:

scrapy crawl xiaoshuo

第二种新建文件,例如main.py文件,运行以下代码:

from scrapy.cmdline import execute

if __name__ == '__main__':
    execute(["scrapy","crawl","xiaoshuo"])

以上就是scrapy框架爬取小说的全部代码,编写时已将思路进行分解,有疑问的欢迎评论或私信博主啊。

评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值