Python爬虫-Requests库

本文介绍了使用Python的requests库进行网络爬虫的基本操作,包括GET和POST请求的发送、响应内容的解析等,并通过实例展示了如何抓取网页、提交关键词搜索及下载网络图片。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

一、get()方法

r = requests.get(url, params=None, **kwargs)

url:获取页面url链接
params:额外参数,字典或字节流格式
**kwargs:控制访问的参数

返回Response对象

>>> import requests
>>> r = requests.get("http://www.baidu.com")
>>> type(r)
<class 'requests.models.Response'>
>>> r.headers
{'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Server': 'bfe/1.0.8.18', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:36 GMT', 'Connection': 'Keep-Alive', 'Pragma': 'no-cache', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Date': 'Sat, 03 Feb 2018 15:35:50 GMT', 'Content-Type': 'text/html'}

Response对象的属性

>>> print(r.status_code)
200
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'utf-8'

二、HTTP协议
URL格式:

http://host[:port][path]

host:合法的Internet主机域名或IP地址
post:端口号
path:请求资源的路径

HTTP对资源的操作:
GET:请求获取URL位置的资源
HEAD:获得资源的头部信息
POST:向URL位置的资源后附加新的数据
Requests中的post()方法

>>> payload = {'key1':'value1', 'key2':'value2'}
>>> r = requests.post('http://httpbin.org/post',data = payload)
>>> print(r.text)
{
  "args": {},
  "data": "",
  "files": {},
  "form": {
    "key1": "value1",
    "key2": "value2"
  },
>>> r = requests.post('http://httpbin.org/post',data = 'ABC')
>>> print(r.text)
{
  "args": {},
  "data": "ABC",
  "files": {},
  "form": {},

PUT:向URL位置的存储资源,覆盖原位置的资源
Requests中的put()方法

>>> r = requests.put('http://httpbin.org/put',data = payload)
>>> print(r.text)
{
  "args": {},
  "data": "",
  "files": {},
  "form": {
    "key1": "value1",
    "key2": "value2"
  },

PATCH:局部更新URL的资源
DELETE:删除URL位置存储的资源

三、Robots协议
Robots协议实例:
京东
https://www.jd.com/robots.txt

User-agent: * 
Disallow: /?* 
Disallow: /pop/*.html 
Disallow: /pinpai/*.html?* 
User-agent: EtaoSpider 
Disallow: / 
User-agent: HuihuiSpider 
Disallow: / 
User-agent: GwdangSpider 
Disallow: / 
User-agent: WochachaSpider 
Disallow: /

百度

User-agent: Baiduspider
Disallow: /baidu
Disallow: /s?
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/

User-agent: Googlebot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
...
User-agent: *
Disallow: /

四、亚马逊商品页面爬取

>>> r = requests.get('https://www.amazon.cn/gp/product/B071VSDKCF')
>>> r.status_code
200
>>> r.request.headers
{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'python-requests/2.14.2'}

>>> kv = {'user-agent':'Mozilla/5.0'}
>>> url = 'https://www.amazon.cn/gp/product/B071VSDKCF'
>>> r = requests.get(url, headers = kv)
>>> r.request.headers
{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'user-agent': 'Mozilla/5.0'}

五、百度关键词提交

>>> kv = {'wd':'Python'}
>>> r = requests.get("http://www.baidu.com/s", params=kv)
>>> r.status_code
200
>>> r.request.url
'http://www.baidu.com/s?wd=Python'
>>> len(r.text)
315911

六、网络图片爬取与存储

>>> root = "D://"
>>> url = "https://www.nationalgeographic.com/content/dam/photography/rights-exempt/best-of-photo-of-the-day/2018/january/08_best-pod-january-18.adapt.1190.1.jpg"
>>> path = root + url.split('/')[-1]
>>> r = requests.get(url)
>>> r.status_code
200
>>> with open(path, 'wb') as f:
...     f.write(r.content)
...
>>> f.close()
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值