Spider类定义了爬取网站的流程以及解析法。
- 定义爬取网站的动作
- 分析爬取下来的网页
demo
scrapy startproject scrapyspiderdome
cd scrapyspiderdome
scrapy genspider httpbin www.httpbin.org
# httpbin.py
import scrapy
class HttpbinSpider(scrapy.Spider):
name = "httpbin"
allowed_domains = ["www.httpbin.org"]
start_urls = ["https://www.httpbin.org"]
def parse(self, response):
print("url", response.url)
print("request", response.request)
print("status", response.status)
print("headers", response.headers)
print("text", response.text)
print("meta", response.meta)
运行
scrapy crawl httpbin
- url:请求页面url
- request: response响应的request对象
- status:状态码
- headers:响应头、resonse响应头
- text:响应体
- meta:一些附加信息、这些参数往往会附在meta属性里
注意:这里没有显式地声明初始请求、是应为Spider默认为我们实现了一个 start_requests方法
代码如下:
def start_requests(self):
for url in self.start_urls:
yield Request(url, dont_filter=True)
逻辑:读取start_urls然后生成Request、这里并没有指定回调函数、默认为 parse函数。
自定义初始请求
重写 start_requests方法。eg:自定义请求页面链接和回调方法。改写start_requests如下:
from typing import Iterable
import scrapy
from scrapy import Request
class HttpbinSpider(scrapy.Spider):
name = "httpbin"
allowed_domains = ["www.httpbin.org"]
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36"
}
start_url = "https://www.httpbin.org/get"
cookies = {'name': 'xiaoding', 'age': '18'}
def start_requests(self) -> Iterable[Request]:
for offset in range(5):
url = self.start_url + f'?offset={offset}'
yield Request(url, headers=self.headers, cookies=self.cookies, meta={'offset': 'offset'})
def parse_response(self, response):
print("url", response.url)
print("request", response.request)
print("status", response.status)
print("headers", response.headers)
print("text", response.text)
print("meta", response.meta)
- 通过重写start_requests方法、我们不在依赖start_urls来生成url、而是通过for循环拼接url
- 为Response的headers赋值
- 添加cookies
- callback:我们声明了一个parse_response方法、同时把Response的callback设置为parse_response、这样Response请求成功后就会调用parse_response方法了
- 设置meta
# 运行
scrapy crawl httpbin
# 结果省略
Spider的POST请求
主要有两种POST请求格式
- Form Data
- JSON
对应的有两种POST请求
- FormRequest
- JsonRequest
实例如下:
import scrapy
from scrapy.http import JsonRequest, FormRequest
class HttpbinSpider(scrapy.Spider):
name = "httpbin"
allowed_domains = ["www.httpbin.org"]
start_url = "https://www.httpbin.org/post"
data = {'name': 'xiaoding', 'age': '18'}
def start_requests(self):
yield FormRequest(self.start_url,
callback=self.parse_response,
formdata=self.data)
yield JsonRequest(self.start_url,
callback=self.parse_response,
data=self.data)
def parse_response(self, response):
print("text", response.text)
运行结果:观察form和data字段的值即可
2025-04-13 16:52:54 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
text {
"args": {},
"data": "",
"files": {},
"form": {
"age": "18",
"name": "xiaoding"
},
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en",
"Content-Length": "20",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "www.httpbin.org",
"User-Agent": "Scrapy/2.12.0 (+https://scrapy.org)",
"X-Amzn-Trace-Id": "Root=1-67fb7b62-695bf647590aae95479c57ee"
},
"json": null,
"origin": "222.89.54.117",
"url": "https://www.httpbin.org/post"
}
2025-04-13 16:52:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST https://www.httpbin.org/post> (failed 1 times): 502 Bad Gateway
2025-04-13 16:52:55 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
text {
"args": {},
"data": "{\"age\": \"18\", \"name\": \"xiaoding\"}",
"files": {},
"form": {},
"headers": {
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en",
"Content-Length": "33",
"Content-Type": "application/json",
"Host": "www.httpbin.org",
"User-Agent": "Scrapy/2.12.0 (+https://scrapy.org)",
"X-Amzn-Trace-Id": "Root=1-67fb7b63-6a9c16240584efb6548d92d0"
},
"json": {
"age": "18",
"name": "xiaoding"
},
"origin": "222.89.54.117",
"url": "https://www.httpbin.org/post"
}