scrapy——开启爬虫篇章-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_24680545/article/details/145422042

scrapy

Python开发的一个快速、高层次的屏幕抓取和web抓取框架，用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛，可以用于数据挖掘、监测和自动化测试。最初是为了页面抓取所设计的，想要看他们的结构特点可以自己百度一下。

开搞

建议不要搞大公司的，但是可以搞海外的服务器什么的。这里就用海外公司的网站。包子漫画

scrapy startproject firstpc
cd firstpc
scrapy genspider bzmh https://www.baozimh.com/

可以看到文件结构是这样的
在这里插入图片描述
test文件是我自己生成的，可以略过。

配置

可以看到有settings.py文件，这个里面有一些需要我们配置。

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "firstpc (+http://www.yourdomain.com)"
USER_AGENT = "Mozilla/5.0"
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 16

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3


# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "zh",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
}

USER_AGENT是用户的agent，很重要，不写很容易被判断为电脑，
ROBOTSTXT_OBEY是是否遵守机器人协议，改为False。
CONCURRENT_REQUESTS是最大并发数，搞少一点，不然容易被封。
DOWNLOAD_DELAY是下载延迟时间，改为时间长一点的比较好，不然没有delay会被拉黑名单。
DEFAULT_REQUEST_HEADERS默认请求头，也需要。

确认要提取的内容

item定义你要提取的内容（定义数据结构），比如我提取的内容为漫画名称和地址，要在items.py中创建两个变量，如下：

import scrapy


class FirstpcItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    url = scrapy.Field()

爬虫部分

要写爬虫的主要部分，在parse方法里面，不想用scrapy自带的xpath，所以改用beautifusoup，主要是方便理解。要在bzmh.py文件里面写，这个文件是执行scrapy genspider bzmh https://www.baozimh.com/这句话的时候生成的，在bzmh文件夹里面。

import scrapy
from bs4 import BeautifulSoup
from ..items import FirstpcItem

class BzmhSpider(scrapy.Spider):
    name = "bzmh"
    allowed_domains = ["www.baozimh.com"]
    start_urls = ["https://www.baozimh.com/classify"]

    def parse(self, response):
        items = FirstpcItem()
        soup = BeautifulSoup(response.text, "html.parser")
        paragraphs = soup.find_all("a", attrs={"href": True, "title": True})
        for paragraph in paragraphs:
            items['name'] = paragraph['title']
            items['url'] = "https://www.baozimh.com" + paragraph['href']
            yield items

管道部分

由于是开始尝试爬虫，只将其输出出来就行了，不保存到数据库中，所以只要在pipelines.py这个文件里面执行print语句就行。

class FirstpcPipeline:
    def process_item(self, item, spider):
        print(item)
        return item

结果如下：

2025-02-02 20:06:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baozimh.com/classify> (referer: None)
{'name': '武煉巔峰', 'url': 'https://www.baozimh.com/comic/wuliandianfeng-pikapi'}
{'name': 'カコミスル老師四格合集',
 'url': 'https://www.baozimh.com/comic/kakomisurulaoshisigeheji-kakomisuru'}
{'name': '中國驚奇先生（神鬼七殺令）',
 'url': 'https://www.baozimh.com/comic/zhongguoliangqixiansheng-quanyingsheng'}
{'name': '殺手古德',
 'url': 'https://www.baozimh.com/comic/shashougude-hangzhouyounuodongmanyouxiangongsi'}
{'name': '公交男女爆笑漫畫',
 'url': 'https://www.baozimh.com/comic/gongjiaonannubaoxiaomanhua-wuyiboruntong'}