spider类初识-scrapy核心

Spider类定义了爬取网站的流程以及解析法。

  • 定义爬取网站的动作
  • 分析爬取下来的网页

demo

scrapy startproject scrapyspiderdome

cd scrapyspiderdome
scrapy genspider httpbin www.httpbin.org
# httpbin.py
import scrapy  
  
  
class HttpbinSpider(scrapy.Spider):  
    name = "httpbin"  
    allowed_domains = ["www.httpbin.org"]  
    start_urls = ["https://www.httpbin.org"]  
  
    def parse(self, response):  
        print("url", response.url)  
        print("request", response.request)  
        print("status", response.status)  
        print("headers", response.headers)  
        print("text", response.text)  
        print("meta", response.meta)

运行

scrapy crawl httpbin
  • url:请求页面url
  • request: response响应的request对象
  • status:状态码
  • headers:响应头、resonse响应头
  • text:响应体
  • meta:一些附加信息、这些参数往往会附在meta属性里

注意:这里没有显式地声明初始请求、是应为Spider默认为我们实现了一个 start_requests方法
代码如下:

def start_requests(self):  
    for url in self.start_urls:  
        yield Request(url, dont_filter=True)

逻辑:读取start_urls然后生成Request、这里并没有指定回调函数、默认为 parse函数。

自定义初始请求

重写 start_requests方法。eg:自定义请求页面链接和回调方法。改写start_requests如下:

from typing import Iterable  
  
import scrapy  
from scrapy import Request  
  
  
class HttpbinSpider(scrapy.Spider):  
    name = "httpbin"  
    allowed_domains = ["www.httpbin.org"]  
    headers = {  
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36"  
    }  
    start_url = "https://www.httpbin.org/get"  
    cookies = {'name': 'xiaoding', 'age': '18'}  
  
    def start_requests(self) -> Iterable[Request]:  
        for offset in range(5):  
            url = self.start_url + f'?offset={offset}'  
            yield Request(url, headers=self.headers, cookies=self.cookies, meta={'offset': 'offset'})  
  
    def parse_response(self, response):  
        print("url", response.url)  
        print("request", response.request)  
        print("status", response.status)  
        print("headers", response.headers)  
        print("text", response.text)  
        print("meta", response.meta)
  • 通过重写start_requests方法、我们不在依赖start_urls来生成url、而是通过for循环拼接url
  • 为Response的headers赋值
  • 添加cookies
  • callback:我们声明了一个parse_response方法、同时把Response的callback设置为parse_response、这样Response请求成功后就会调用parse_response方法了
  • 设置meta
# 运行
scrapy crawl httpbin
# 结果省略

Spider的POST请求
主要有两种POST请求格式

  • Form Data
  • JSON

对应的有两种POST请求

  • FormRequest
  • JsonRequest

实例如下:

import scrapy  
from scrapy.http import JsonRequest, FormRequest  
  
  
class HttpbinSpider(scrapy.Spider):  
    name = "httpbin"  
    allowed_domains = ["www.httpbin.org"]  
    start_url = "https://www.httpbin.org/post"  
    data = {'name': 'xiaoding', 'age': '18'}  
  
    def start_requests(self):  
        yield FormRequest(self.start_url,  
                          callback=self.parse_response,  
                          formdata=self.data)  
        yield JsonRequest(self.start_url,  
                          callback=self.parse_response,  
                          data=self.data)  
  
    def parse_response(self, response):  
        print("text", response.text)

运行结果:观察form和data字段的值即可

2025-04-13 16:52:54 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
text {
  "args": {},
  "data": "",
  "files": {},
  "form": {
    "age": "18",
    "name": "xiaoding"
  },
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate",
    "Accept-Language": "en",
    "Content-Length": "20",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "www.httpbin.org",
    "User-Agent": "Scrapy/2.12.0 (+https://scrapy.org)",
    "X-Amzn-Trace-Id": "Root=1-67fb7b62-695bf647590aae95479c57ee"
  },
  "json": null,
  "origin": "222.89.54.117",
  "url": "https://www.httpbin.org/post"
}

2025-04-13 16:52:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST https://www.httpbin.org/post> (failed 1 times): 502 Bad Gateway
2025-04-13 16:52:55 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
text {
  "args": {},
  "data": "{\"age\": \"18\", \"name\": \"xiaoding\"}",
  "files": {},
  "form": {},
  "headers": {
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "Accept-Encoding": "gzip, deflate",
    "Accept-Language": "en",
    "Content-Length": "33",
    "Content-Type": "application/json",
    "Host": "www.httpbin.org",
    "User-Agent": "Scrapy/2.12.0 (+https://scrapy.org)",
    "X-Amzn-Trace-Id": "Root=1-67fb7b63-6a9c16240584efb6548d92d0"
  },
  "json": {
    "age": "18",
    "name": "xiaoding"
  },
  "origin": "222.89.54.117",
  "url": "https://www.httpbin.org/post"
}

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值