Scrapy （一）爬虫基本操作_scrapy的一个项目建好后,爬虫都在在项目中建立的,所以,进入项目根目录,-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_38336343/article/details/104713719

本文介绍了使用Scrapy创建爬虫的基本步骤，包括通过命令行创建项目和爬虫，然后讲解了如何使用xpath等方法提取数据，如extract()和extract_first()函数的用法。最后，阐述了利用pipeline进行数据保存的流程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Scrapy笔记

1.创建爬虫

##命令方式

1.创建项目： scrapy startporject [爬虫名字]

2.创建爬虫：进入到项目所在的路径，执行命令： scrapy genspider [爬虫名字][爬虫的域名]

注意：爬虫名字不能和项目名称一致

3.提取数据

完善spider 使用xpath等方法 itcast.py 文件中的方法
extract()返回一个包含有字符串函数数据的列表
extract_first()返回列表中的第一个字符串

class ItcastSpider(scrapy.Spider):
    name = 'itcast' #爬虫名
    allowed_domains = ['itcast.cn'] # 允许爬取的范围
    start_urls = ['http://www.itcast.cn/channel/teacher.shtml'] # 最开始的url

    def parse(self, response):
        # 处理start_url地址对应的响应
        # ret1 = response.xpath("//div[@class='tea_con']//h3/text()").extract()
        # print(ret1)

        #分组
        li_list = response.xpath("//div[@class='tea_con']//li") # 进入到li目录
        for li in li_list:
            item = {}
            item["name"] = li.xpath(".//h3/text()").extract_first()
            item["title"] = li.xpath(".//h4/text()").extract_first()
            # print(item)
            # Reuqest BaseItem
            yield item # 传给 pipeline

4.保存数据

pipeline保存数据

#pipeline.py 中的设置
class MyspiderPipeline(object):
    def process_item(self, item, spider): # 这个函数不能修改名字
        item["hello"] = "world"
        # print(item)
        return item

# 第二个pipeline
class MyspiderPipeline1(object):
    def process_item(self, item, spider): # 这个函数不能修改名字
        print(item)
        return item


#########################################################
# settings.py 中需要开启item
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'mySpider.pipelines.MyspiderPipeline': 300, # 数字越小表示距离越近则优先级越高
    'mySpider.pipelines.MyspiderPipeline1': 301, # 第二个pipeline
}