使用Scrapy框架爬取小说网站的详细教程-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_43215588/article/details/123030827

首先创建相关项目文件，打开cmd输入以下命令：

scrapy startproject  项目名称

接着切换到目录文件：

cd 项目名称

定义要爬取的网站：

scrapy genspider 爬虫名称  起始url网站(域名)

过程如下：

C:\Users\Administrator\Desktop\scrapy>scrapy startproject xiaoshuo_text
New Scrapy project 'xiaoshuo_text', using template directory 'c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\scrapy\templates\project', created in:
    C:\Users\Administrator\Desktop\scrapy\xiaoshuo_text

You can start your first spider with:
    cd xiaoshuo_text
    scrapy genspider example example.com

C:\Users\Administrator\Desktop\scrapy>cd xiaoshuo_text
C:\Users\Administrator\Desktop\scrapy\xiaoshuo_text>scrapy genspider xiaoshuo 81zw.com
Created spider 'xiaoshuo' using template 'basic' in module:
  xiaoshuo_text.spiders.xiaoshuo

工程文件如下：
在这里插入图片描述

工程文件准备完成，第一步先设置好settings.py文件，主要设置以下：

（1）LOG_LEVEL = “WARNING” 设置记录日志只显示警告级别信息；
（2）设置爬虫请求头 USER_AGENT 及登录需的cookies信息 DEFAULT_REQUEST_HEADERS；
（3）打开传输管道 ITEM_PIPELINES；
（4）存储运行消息 LOG_FILE = “./log.log”；
（5）设置下载的等待时间DOWNLOAD_DELAY，降低访问的频率

# Scrapy settings for xiaoshuo_text project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from fake_useragent import UserAgent
import random

BOT_NAME = 'xiaoshuo_text'

SPIDER_MODULES = ['xiaoshuo_text.spiders']
NEWSPIDER_MODULE = 'xiaoshuo_text.spiders'

# 设置记录日志只显示警告级别信息
LOG_LEVEL = "WARNING"

# 存储运行消息
LOG_FILE = "./log.log"

ua = UserAgent()
header = {"User-Agent": ua.random ,}
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = ua

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 设置下载的等待时间，降低访问的频率
DOWNLOAD_DELAY = 2
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'xiaoshuo_text.middlewares.XiaoshuoTextSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'xiaoshuo_text.middlewares.XiaoshuoTextDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

第二步获取对应数据，编写xiaoshuo.py文件，例如爬取的网页为xxxx.html，可以设置参数start_urls为该网址，代码如下：

import scrapy

class XiaoshuoSpider(scrapy.Spider):
    name = 'xiaoshuo'
    allowed_domains = ['xxxx.com']
    start_urls = ['xxxxx.html']

    def parse(self, response):
        # 获取章节的标题和内容
        chapter_title = response.xpath("//h1/text()").extract_first()
        chapter_content = "".join(response.xpath("//*[@id='content']/text()").extract()).replace("\u3000\u3000","\n     ")

        # 构建字典，使用调度器进行传输
        data = {}
        data["title"] = chapter_title
        data["content"] = chapter_content
        # print(data)
        yield data

        # 获取下一章内容，循环下载
        next_chapter = response.xpath('//div[@class="bookname"]/div[1]/a[3]/@href').extract_first()
        # base_url = "https://www.81zw.com/{}".format(next_chapter)
        if next_chapter.find(".html") != 1:
            yield scrapy.Request(response.urljoin(next_chapter), callback=self.parse)

第三步处理数据并保持在本地，编辑pipelines.py文件，接受xiaoshuo.py传来的数据并保存，代码如下：

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class XiaoshuoTextPipeline:
    def open_spider(self, spider):
        self.file = open("wddf.txt", "w", encoding="utf-8")

    def process_item(self, item, spider):
        title = item.get("title")
        content = item.get("content")
        print(title)
        info = title + "\n" + "     " + content + "\n"
        self.file.write(info)
        # 从内存中取出数据，相对与刷新一下
        self.file.flush()
        return item

    def close_spider(self, spider):
        self.file.close()

最后运行scrapy框架进行爬取，运行有两种方式，第一种打开cmd输入以下代码：

scrapy crawl xiaoshuo

第二种新建文件，例如main.py文件，运行以下代码：

from scrapy.cmdline import execute

if __name__ == '__main__':
    execute(["scrapy","crawl","xiaoshuo"])

以上就是scrapy框架爬取小说的全部代码，编写时已将思路进行分解，有疑问的欢迎评论或私信博主啊。