一、get()方法
r = requests.get(url, params=None, **kwargs)
url:获取页面url链接
params:额外参数,字典或字节流格式
**kwargs:控制访问的参数
返回Response对象
>>> import requests
>>> r = requests.get("http://www.baidu.com")
>>> type(r)
<class 'requests.models.Response'>
>>> r.headers
{'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Server': 'bfe/1.0.8.18', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:36 GMT', 'Connection': 'Keep-Alive', 'Pragma': 'no-cache', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Date': 'Sat, 03 Feb 2018 15:35:50 GMT', 'Content-Type': 'text/html'}
Response对象的属性
>>> print(r.status_code)
200
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding
'utf-8'
二、HTTP协议
URL格式:
http://host[:port][path]
host:合法的Internet主机域名或IP地址
post:端口号
path:请求资源的路径
HTTP对资源的操作:
GET:请求获取URL位置的资源
HEAD:获得资源的头部信息
POST:向URL位置的资源后附加新的数据
Requests中的post()方法
>>> payload = {'key1':'value1', 'key2':'value2'}
>>> r = requests.post('http://httpbin.org/post',data = payload)
>>> print(r.text)
{
"args": {},
"data": "",
"files": {},
"form": {
"key1": "value1",
"key2": "value2"
},
>>> r = requests.post('http://httpbin.org/post',data = 'ABC')
>>> print(r.text)
{
"args": {},
"data": "ABC",
"files": {},
"form": {},
PUT:向URL位置的存储资源,覆盖原位置的资源
Requests中的put()方法
>>> r = requests.put('http://httpbin.org/put',data = payload)
>>> print(r.text)
{
"args": {},
"data": "",
"files": {},
"form": {
"key1": "value1",
"key2": "value2"
},
PATCH:局部更新URL的资源
DELETE:删除URL位置存储的资源
三、Robots协议
Robots协议实例:
京东
https://www.jd.com/robots.txt
User-agent: *
Disallow: /?*
Disallow: /pop/*.html
Disallow: /pinpai/*.html?*
User-agent: EtaoSpider
Disallow: /
User-agent: HuihuiSpider
Disallow: /
User-agent: GwdangSpider
Disallow: /
User-agent: WochachaSpider
Disallow: /
百度
User-agent: Baiduspider
Disallow: /baidu
Disallow: /s?
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: Googlebot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
...
User-agent: *
Disallow: /
四、亚马逊商品页面爬取
>>> r = requests.get('https://www.amazon.cn/gp/product/B071VSDKCF')
>>> r.status_code
200
>>> r.request.headers
{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'python-requests/2.14.2'}
>>> kv = {'user-agent':'Mozilla/5.0'}
>>> url = 'https://www.amazon.cn/gp/product/B071VSDKCF'
>>> r = requests.get(url, headers = kv)
>>> r.request.headers
{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'user-agent': 'Mozilla/5.0'}
五、百度关键词提交
>>> kv = {'wd':'Python'}
>>> r = requests.get("http://www.baidu.com/s", params=kv)
>>> r.status_code
200
>>> r.request.url
'http://www.baidu.com/s?wd=Python'
>>> len(r.text)
315911
六、网络图片爬取与存储
>>> root = "D://"
>>> url = "https://www.nationalgeographic.com/content/dam/photography/rights-exempt/best-of-photo-of-the-day/2018/january/08_best-pod-january-18.adapt.1190.1.jpg"
>>> path = root + url.split('/')[-1]
>>> r = requests.get(url)
>>> r.status_code
200
>>> with open(path, 'wb') as f:
... f.write(r.content)
...
>>> f.close()