简介
python内置的HTTP请求库,用于HTTP的请求和响应处理。代码比urllib类库要简洁很多
请求类型
- requests.post
- requests.get
- requests.delete
- requests.head
- requests.options
post和get 请求在方法和使用上是一样的,这里以get请求做示例
get请求方法
方法一(url后跟params参数):
#-*- coding: utf-8 -*-
import requests
r = requests.get('http://www.python.org?name=user&age=18')
print(r.text)
方法二(字典形式存储params参数):
#-*- coding: utf-8 -*-
import requests
data = {'name':'user','age':'18'}
r = requests.get('http://www.python.org',params=data)
print(r.text)
get请求示例
get常用请求
import requests
r = requests.get('http://www.python.org')
print("类型:",type(r))
print("状态",r.status_code)
print("响应头",r.headers)
print("响应体类型",type(r.text))
print("响应体",r.text)
print("cookies",r.cookies)
类型: <class 'requests.models.Response'>
状态 200
响应头 {'Server': 'nginx', 'Content-Type': 'text/html; charset=utf-8', 'X-Frame-Options': 'DENY', 'Via': '1.1 vegur, 1.1 varnish, 1.1 varnish', 'Content-Length': '49230', 'Accept-Ranges': 'bytes', 'Date': 'Fri, 17 May 2019 03:50:39 GMT', 'Age': '1597', 'Connection': 'keep-alive', 'X-Served-By': 'cache-iad2142-IAD, cache-tyo19921-TYO', 'X-Cache': 'HIT, HIT', 'X-Cache-Hits': '1, 2272', 'X-Timer': 'S1558065040.883277,VS0,VE0', 'Vary': 'Cookie', 'Strict-Transport-Security': 'max-age=63072000; includeSubDomains'}
响应体 <html>...略...</html>
响应体类型 <class 'str'>
cookies <RequestsCookieJar[]>
以上可以看到响应体类型是字符串格式,可以通过json()方法转换为json格式,防止解析错误,抛出json.decoder.JSONDecodeError异常
get请求配合re模块过滤内容
#-*- coding: utf-8 -*-
import requests,re
# 请求头
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
r = requests.get('https://www.douban.com/',headers=headers) # get请求
a_tag = re.findall("<a.*</a>",r.text) # 正则过滤请求内容
print("网页所有a标签:",a_tag)
网页所有a标签: ['<a target="_blank" class="lnk-book" href="https://book.douban.com">豆瓣读书</a>', '<a target="_blank" class="lnk-movie" href="https://movie.douban.com">豆瓣电影</a>', '<a target="_blank" class="lnk-music" href="https://music.douban.com">豆瓣音乐</a>',...略...]
get请求二进制数据(示例为图片)
#-*- coding: utf-8 -*-
import requests
r = requests.get('http://www.soso.com/soso/images/favicon_new.ico')
print("按文本方式提取",r.text)
print("bytes类型提取:",r.content)
按文本方式提取 h ( ��� ��� ��� ��� ������]���‰���±�������
bytes类型提取: b'\x00\x00\x01\x00\x01\x00\x10\x10\x00\x00\x01\x00 \x00h\x04\x00\x00\x16\x00\x00\x00(\x00\x00\x00\x10\x00\x00\x00 \x00\x00\x00\x01'
post请求示例
普通post请求和方法略,仅需把get请求替换为post请求即可,以下示例适用于大部分以post请求提交的场景
post请求提交数据
import requests
data = {'name':'user','age':'18'}
r = requests.post('http://www.aaa.com,data=data')
post请求上传文件
import requests
files = {'file':open('favicon.ico','rb')}
r = requests.post('http://www.aaa.com,files=files')
获取Cookies
cookies对象获取,也可以在浏览器开发者模式Network > Headers 中查看Cookie值
#-*- coding: utf-8 -*-
import requests
r = requests.get("https://www.zhihu.com/")
print("cookies值对象:",r.cookies)
# 遍历输出cookie
for key,value in r.cookies.items():
print("cookies值:",key + '=' + value)
cookies值对象: <RequestsCookieJar[<Cookie _xsrf=ccDJCJBWZFXbSdOIqS5fVks34CVhFwQ8 for .zhihu.com/>, <Cookie tgw_l7_route=a37704a413efa26cf3f23813004f1a3b for www.zhihu.com/>]>
cookies值: _xsrf=ccDJCJBWZFXbSdOIqS5fVks34CVhFwQ8
cookies值: tgw_l7_route=a37704a413efa26cf3f23813004f1a3b
携带cookies访问(以下仅为小示例,当然现在很多网站都做了防爬虫处理,需要加入Authorization认证。防爬虫处理后边会写到,这里仅做写法样式展示)
#-*- coding: utf-8 -*-
import requests
headers={
'Cookies':'tgw_l7_route=116a747939468d99065d12a386ab1c5f; _xsrf=PZRQSNJ7BbRYSF3C43RmJqeJg5IDjQm6; tst=r; q_c1=3cd18195e35d4c0f8ec94ec70bbec158|1558078820000|1558078820000',
'Host': 'www.aaa.com',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36',
}
r = requests.get('https://www.aaa.com/',headers=headers)
print(r.text)
session会话保持
没有使用session会话保持测试
#-*- coding: utf-8 -*-
import requests
r = requests.get('http://www.httpbin.org/cookies/set/number/123456')
print("第一次登录访问:",r.text)
r2 = requests.get('http://www.httpbin.org/cookies')
print("访问该域名另外一个连接:",r2.text)
第一次登录访问: {
"cookies": {
"number": "123456"
}
}
访问该域名另外一个连接: {
"cookies": {}
}
使用session会话保持,就不用每次请求都携带cookies值了
#-*- coding: utf-8 -*-
import requests
s = requests.Session() # 会话保持
r = s.get('http://www.httpbin.org/cookies/set/number/123456')
print("第一次登录访问:",r.text)
r2 = s.get('http://www.httpbin.org/cookies')
print("访问该域名另外一个连接:",r2.text)
第一次登录访问: {
"cookies": {
"number": "123456"
}
}
访问该域名另外一个连接: {
"cookies": {
"number": "123456"
}
}
SSL警告和忽略
request防止因为ssl配置失效或其它原因请求被拦截:
#-*- coding: utf-8 -*-
import requests
from requests.packages import urllib3
urllib3.disable_warnings()
r = requests.get('https://www.python.org/',verify=False)
print(r.status_code)
代理设置
普通代理设置参考
#-*- coding: utf-8 -*-
import requests
proxies = {
"http":"http://10.0.0.10:3128",
"https":"https//10.0.0.10:1080",
}
requests.get("https://www.python.org",proxies=proxies)
包含 HTTP Basic Auth 认证的代理设置参考
#-*- coding: utf-8 -*-
import requests
proxies = {
'http':'http://user:password@host:port',
'https':'http://user:password@host:port',
}
requests.get("https://www.python.org",proxies=proxies)
SOCKS协议代理(需要pip安装'requests[socks]')
#-*- coding: utf-8 -*-
import requests
proxies = {
'http':'socks5://user:password@host:port',
'https':'socks5://user:password@host:port',
}
requests.get("https://www.python.org",proxies=proxies)
超时设置
#-*- coding: utf-8 -*-
import requests
r = requests.get('http://www.python.org',timeout=10)
print(r.status_code)
身份认证
HTTPBasicAuth 认证
#-*- coding: utf-8 -*-
import requests
from requests.auth import HTTPBasicAuth
r = requests.get('http://localhost:5000',auth=HTTPBasicAuth('username','password'))
print(r.status_code)
requests默认认证为HTTPBasicAuth,所有有更简洁的写法
#-*- coding: utf-8 -*-
import requests
r = requests.get('http://localhost:5000',auth=('username','password'))
print(r.status_code)
其它身份认证(以OAuth认证为例,需要先pip安装requests_oauthlib模块)
#-*- coding: utf-8 -*-
import requests
from requests_oauthlib import OAuth1
url = 'https://api.twitter.com/1.1/accout/verfify_crdentials.json'
auth = OAuth1('YOUR_APP_KEY','YOUR_APP_SECRET','USER_OAUTH_TOKEN','USER_OAUTH_TOKEN_SECRET')
requests.get(url,auth=auth)
Prepared Request 数据结构
将请求表示为数据结构,其中各个参数通过一个 Prepared Request 对象来表示
#-*- coding: utf-8 -*-
from requests import Request,Session
url = 'http://httpbin.org/post'
data = {'name':'user'}
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
s = Session() # 实例化session对象
req = Request('POST',url,data=data,headers=headers) # 实例化Request对象
prepped = s.prepare_request(req) # 使用session的prepare_request转换为Prepared Request对象
r = s.send(prepped) # 调用s.send()方法提交发送数据
print(r.text)
{
"args": {},
"data": "",
"files": {},
"form": {
"name": "user"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Content-Length": "9",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36"
},
"json": null,
"origin": "119.98.241.138, 119.98.241.138",
"url": "https://httpbin.org/post"
}
抓取示例
抓取猫眼电影信息
#-*- coding: utf-8 -*-
import requests,re,json
def get_one_page(url,headers):
r = requests.get(url,headers=headers) # 请求地址
return r.text
def video_name(html):
pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a.*?>(.*?)</a>.*?star">[\n]*(.*?)</p>.*?releasetime">(.*?)</p>',re.S) # 自定义正则匹配
s = re.findall(pattern,html) # 正则查找
return s
def write_to_file(content):
with open('result.txt','a',encoding='utf8') as f: # 新建文件
for i in content: # 遍历内容便于存储
f.write(json.dumps(i,ensure_ascii=False)+'\n') # json格式逐条写入数据到文件
def main():
url = 'https://maoyan.com/board/4'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
html = get_one_page(url,headers) # 执行请求地址函数
con = video_name(html) # 正则匹配过滤内容
print("匹配到的内容,排行/链接/名称/主演/上映时间:",con)
write_to_file(con) # 写入文件
main()
["1", "https://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w_220h_1e_1c", "霸王别姬", " 主演:张国荣,张丰毅,巩俐\n ", "上映时间:1993-01-01"]
["2", "https://p0.meituan.net/movie/283292171619cdfd5b240c8fd093f1eb255670.jpg@160w_220h_1e_1c", "肖申克的救赎", " 主演:蒂姆·罗宾斯,摩根·弗里曼,鲍勃·冈顿\n ", "上映时间:1994-09-10(加拿大)"]
["3", "https://p0.meituan.net/movie/289f98ceaa8a0ae737d3dc01cd05ab052213631.jpg@160w_220h_1e_1c", "罗马假日", " 主演:格利高里·派克,奥黛丽·赫本,埃迪·艾伯特\n ", "上映时间:1953-09-02(美国)"]
["4", "https://p1.meituan.net/movie/6bea9af4524dfbd0b668eaa7e187c3df767253.jpg@160w_220h_1e_1c", "这个杀手不太冷", " 主演:让·雷诺,加里·奥德曼,娜塔莉·波特曼\n ", "上映时间:1994-09-14(法国)"]
["5", "https://p1.meituan.net/movie/b607fba7513e7f15eab170aac1e1400d878112.jpg@160w_220h_1e_1c", "泰坦尼克号", " 主演:莱昂纳多·迪卡普里奥,凯特·温丝莱特,比利·赞恩\n ", "上映时间:1998-04-03"]
["6", "https://p0.meituan.net/movie/da64660f82b98cdc1b8a3804e69609e041108.jpg@160w_220h_1e_1c", "唐伯虎点秋香", " 主演:周星驰,巩俐,郑佩佩\n ", "上映时间:1993-07-01(中国香港)"]
["7", "https://p0.meituan.net/movie/46c29a8b8d8424bdda7715e6fd779c66235684.jpg@160w_220h_1e_1c", "魂断蓝桥", " 主演:费雯·丽,罗伯特·泰勒,露塞尔·沃特森\n ", "上映时间:1940-05-17(美国)"]
["8", "https://p0.meituan.net/movie/223c3e186db3ab4ea3bb14508c709400427933.jpg@160w_220h_1e_1c", "乱世佳人", " 主演:费雯·丽,克拉克·盖博,奥利维娅·德哈维兰\n ", "上映时间:1939-12-15(美国)"]
["9", "https://p1.meituan.net/movie/ba1ed511668402605ed369350ab779d6319397.jpg@160w_220h_1e_1c", "天空之城", " 主演:寺田农,鹫尾真知子,龟山助清\n ", "上映时间:1992"]
["10", "https://p0.meituan.net/movie/b0d986a8bf89278afbb19f6abaef70f31206570.jpg@160w_220h_1e_1c", "辛德勒的名单", " 主演:连姆·尼森,拉尔夫·费因斯,本·金斯利\n ", "上映时间:1993-12-15(美国)"]