首先创建相关项目文件,打开cmd输入以下命令:
scrapy startproject 项目名称
接着切换到目录文件:
cd 项目名称
定义要爬取的网站:
scrapy genspider 爬虫名称 起始url网站(域名)
过程如下:
C:\Users\Administrator\Desktop\scrapy>scrapy startproject xiaoshuo_text
New Scrapy project 'xiaoshuo_text', using template directory 'c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\scrapy\templates\project', created in:
C:\Users\Administrator\Desktop\scrapy\xiaoshuo_text
You can start your first spider with:
cd xiaoshuo_text
scrapy genspider example example.com
C:\Users\Administrator\Desktop\scrapy>cd xiaoshuo_text
C:\Users\Administrator\Desktop\scrapy\xiaoshuo_text>scrapy genspider xiaoshuo 81zw.com
Created spider 'xiaoshuo' using template 'basic' in module:
xiaoshuo_text.spiders.xiaoshuo
工程文件如下:
工程文件准备完成,第一步先设置好settings.py文件,主要设置以下:
(1)LOG_LEVEL = “WARNING” 设置记录日志只显示警告级别信息;
(2)设置爬虫请求头 USER_AGENT 及登录需的cookies信息 DEFAULT_REQUEST_HEADERS;
(3)打开传输管道 ITEM_PIPELINES;
(4)存储运行消息 LOG_FILE = “./log.log”;
(5)设置下载的等待时间DOWNLOAD_DELAY,降低访问的频率
# Scrapy settings for xiaoshuo_text project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from fake_useragent import UserAgent
import random
BOT_NAME = 'xiaoshuo_text'
SPIDER_MODULES = ['xiaoshuo_text.spiders']
NEWSPIDER_MODULE = 'xiaoshuo_text.spiders'
# 设置记录日志只显示警告级别信息
LOG_LEVEL = "WARNING"
# 存储运行消息
LOG_FILE = "./log.log"
ua = UserAgent()
header = {"User-Agent": ua.random ,}
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = ua
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 设置下载的等待时间,降低访问的频率
DOWNLOAD_DELAY = 2
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'xiaoshuo_text.middlewares.XiaoshuoTextSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'xiaoshuo_text.middlewares.XiaoshuoTextDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
第二步获取对应数据,编写xiaoshuo.py文件,例如爬取的网页为xxxx.html,可以设置参数start_urls为该网址,代码如下:
import scrapy
class XiaoshuoSpider(scrapy.Spider):
name = 'xiaoshuo'
allowed_domains = ['xxxx.com']
start_urls = ['xxxxx.html']
def parse(self, response):
# 获取章节的标题和内容
chapter_title = response.xpath("//h1/text()").extract_first()
chapter_content = "".join(response.xpath("//*[@id='content']/text()").extract()).replace("\u3000\u3000","\n ")
# 构建字典,使用调度器进行传输
data = {}
data["title"] = chapter_title
data["content"] = chapter_content
# print(data)
yield data
# 获取下一章内容,循环下载
next_chapter = response.xpath('//div[@class="bookname"]/div[1]/a[3]/@href').extract_first()
# base_url = "https://www.81zw.com/{}".format(next_chapter)
if next_chapter.find(".html") != 1:
yield scrapy.Request(response.urljoin(next_chapter), callback=self.parse)
第三步处理数据并保持在本地,编辑pipelines.py文件,接受xiaoshuo.py传来的数据并保存,代码如下:
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
class XiaoshuoTextPipeline:
def open_spider(self, spider):
self.file = open("wddf.txt", "w", encoding="utf-8")
def process_item(self, item, spider):
title = item.get("title")
content = item.get("content")
print(title)
info = title + "\n" + " " + content + "\n"
self.file.write(info)
# 从内存中取出数据,相对与刷新一下
self.file.flush()
return item
def close_spider(self, spider):
self.file.close()
最后运行scrapy框架进行爬取,运行有两种方式,第一种打开cmd输入以下代码:
scrapy crawl xiaoshuo
第二种新建文件,例如main.py文件,运行以下代码:
from scrapy.cmdline import execute
if __name__ == '__main__':
execute(["scrapy","crawl","xiaoshuo"])
以上就是scrapy框架爬取小说的全部代码,编写时已将思路进行分解,有疑问的欢迎评论或私信博主啊。