Scrapy 简单入门

张飞的技术博客

已于 2024-04-03 18:03:28 修改

阅读量361

点赞数 5

文章标签： scrapy

于 2024-04-03 17:57:25 首次发布

本文链接：https://blog.youkuaiyun.com/qq_45878803/article/details/137353691

版权

本文详细介绍了Scrapy框架的关键组件，包括引擎、项目、调度器、下载器、爬虫、管道和中间件。通过步骤展示了如何安装Scrapy、创建项目和Spider，以及如何编写和运行爬虫，最后涉及数据的存储方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

简介

Scrapy主要包括了以下组件：

引擎(Scrapy Engine)

Item 项目

调度器(Scheduler)

下载器(Downloader)

爬虫(Spiders)

项目管道(Pipeline)

下载器中间件(Downloader Middlewares)

爬虫中间件(Spider Middlewares)

调度中间件(Scheduler Middewares)

1. 安装scrapy

pip install scrapy

2. 创建一个Scrapy项目

scrapy startproject myspider

3. 创建一个Spider

cd myspider

scrapy genspider itcast.com itcast.com

4. 在Spider中编写代码

在生成的 itcast.py 里编写允许运行的域名，起始页url

    name = "itcast"  # 爬虫名称
    allowed_domains = ["itcast.cn"]  # 允许的域名
    # start_urls = ["https://itcast.cn"]  # 起始url
    start_urls = ["https://www.itheima.com/teacher.html#ajavaee"]  # 起始url
    def parse(self, response):
        # 对于网站的相关操作
        # with open('itcast.html', 'wb') as f:
        #     f.write(response.body)
        node_list = response.xpath('//div[@class="li_txt"]') 
        # print(f'节点数为：{len(node_list)}')
        for node in node_list:
            temp = {}
            # 使用extract 而不是用下标取是为了避免json格式化问题
            temp['name'] = node.xpath('h3/text()').extract_first()
            temp['title'] = node.xpath('h4/text()').extract()
            temp['desc'] = node.xpath('p/text()').extract()
            # print(temp)
            # 使用迭代器而不是return 是为了便于数据的继续爬取
            yield temp

5.爬取数据

scrapy crawl myspider

6. 存储数据

settings.py


ITEM_PIPELINES = {
   "myspider.pipelines.MyspiderPipeline": 300,  # 指定启动的管道类 MyspiderPipeline，后面的数字代表启动的优先级，越小优先级越高，需要可以创建新的管道类
}

pipelines.py

class MyspiderPipeline:
    def __init__(self):
        self.file = open('itcast.json', 'w')
    def process_item(self, item, spider):
        # print(f'itme:{item}')
        # 将数据强制转化
        item = dict(item)
        # 将字典数据序列化
        json_data = json.dumps(item,ensure_ascii=False) + ',\n'
        # 将数据写入文件
        self.file.write(json_data)
        # 默认
        return item

    def __del__(self):
        self.file.close()
        
        ```