requests库

最新推荐文章于 2022-12-27 14:18:00 发布

gs_every

最新推荐文章于 2022-12-27 14:18:00 发布

阅读量694

点赞数 2

CC 4.0 BY-SA版权

分类专栏：笔记本推荐爬虫文章标签： url 爬虫库 requests库

本文链接：https://blog.youkuaiyun.com/s1h2e3n4g5/article/details/75258721

笔记本推荐同时被 2 个专栏收录

12 篇文章

订阅专栏

爬虫

5 篇文章

订阅专栏

pip install requests 安装requests库
import requests

r = requests.get(url)

返回两个对象，response 对象和request对象
response包含爬虫返回的内容

r.status_code# HTTP请求返回状态，200表示连接成功，404表示失败

r.text #HTTP相应内容的字符串形式，即url对应的页面内容

r.encoding #从HTTP header 中猜测的响应内容编码方式
# 如果header中不存在charset，则认为编码默认ISO-8859-1

r.apparent_encoding #从内容中分析出响应内容编码方式（备选编码方式）

r.content # HTTP响应内容的二进制形式

requests 异常

requests.ConnectionError#网络连接异常，如DNS查询失败，防火墙拒绝连接

requests.HTTPError #HTTP错误异常

requests.URLRequired # URL缺失异常

requests.TooManyRedireets #超过最大重定向次数，产生重定向异常

requests.ConnectTimeout# 连接远程服务器超时

requests.Timeout # 请求URL超时，产生超时异常

通用代码框架

#coding=utf-8
import requests
def getHTMLText(url):
    try:
        r = requests.get(url)
        r.raise_for_ststus() # 如果状态码不是200，引发HTTPError异常
        r.encoding = r.apparent_encoding
        return r.text

if __name__ == '__main__':
    url = "http://www.baidu.com"
    print getHTMLText(url)

HTTP协议与requests库方法

HTTP 超文本传输协议，
以URL作为定位网络资源标识，URL格式：
http://host[:post][path]
host 为合法的Internet域名或IP地址
post 端口号，缺省端口为80
path 请求资源路径

HTTP协议与requests库方法一致
get <-> requests.get 请求获取URL的位置资源
head <->requests.head # 获得该资源的头部信息
post <->requests.post 向URL位置后添加新的数据
put <-> request.put # 请求向URL位置存储一个资源，
覆盖原有资源
patch <-> requests.patch 改变该处资源的部分内容
delete<-> 请求删除URL位置资源

requests.request(method,url,**kwargs)

method有7种：
requests.request(‘GET’,url,**kwargs)
‘HEAD’,’POST’,’PUT’,’PATCH’,’delete’,’OPTION’

r.requests(‘GET’,url)
print r.request.url
能够输出向服务器提交的url链接。

**kwargs为可选参数13个:

1.params 作为参数增加到url中

kv={'key1':'value1','key2':'value2'}
r=requests.request('GET','http://python123.io/ws',params=kv)
print(r.url)
#http://python123.io/ws?key1=value1&key2=value2

2.data : 字典、字节序列或文件对象，作为Request的内容

kv={'key1':'value1','key2':'value2'}
r=requests.request('POST','http://python123.io/ws',data=kv)
body='主体内容'
r=requests.request('POST','http://python123.io/ws',data=body)

3.json: JSON格式的数据，作为Request的内容

kv={'key1':'value1'}
r=requests.request('POST','http://python123.io/ws',json=kv)

4.headers : 字典，HTTP定制头

hd={'user‐agent':'Chrome/10'}
r=requests.request('POST','http://python123.io/ws',headers=hd)

5.cookies : 字典或CookieJar，Request中的cookie

6.auth: 元组，支持HTTP认证功能

7.files : 字典类型，传输文件

fs={'file':open('data.xls','rb')}
r=requests.request('POST','http://python123.io/ws',files=fs)

可以用来向URL传输文件

8.timeout : 设定超时时间，秒为单位

r=requests.request('GET','http://www.baidu.com',timeout=10)

9.proxies : 字典类型，设定访问代理服务器，可以增加登录认证

pxs={
'http':'http://user:pass@10.10.10.1:1234','https':'https://10.10.10.1:4321'
}
r=requests.request('GET','http://www.baidu.com',proxies=pxs)

10.allow_redirects: True/False，默认为True，重定向开关

11.stream : True/False，默认为True，获取内容立即下载开关

12.verify : True/False，默认为True，认证SSL证书开关

13.cert : 本地SSL证书路径

requests.get(url,params=None, **kwargs)

∙params: url中的额外参数，字典或字节流格式，可选
**kwargs: 12个控制访问的参数

requests.head(url,**kwargs)
∙**kwargs: 12个控制访问的参数
常用

更改头部信息访问：

coding=utf-8
import requests
url = 'https://www.amazon.cn/gp/product/B01M8L5Z3Y'
#有些网站根据头部信息判断为爬虫拒绝访问，可以更改头部信息来访问
try:
    kv = {'user-agent':'Mozilla/5.0'}
    r = requests.get(url,headers=kv)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print r.text[:3000]

except:
    print 'error'

百度360关键字访问

import requests
keyword = 'pyhton'
url = 'http://www.baidu.com'
try:
    kv = {'wd':keyword}
    r = requests.get(url,params = kv)
    r.encoding = r.apparent_encoding
    print r.request.url
    r.raise_for_status()
    print len(r.text)
except:
    print "error"

向百度添加关键字，键值对
360将’wd’ 改为’q’

#图片爬取全代码
import requests
import os
url= "http://p4.so.qhimgs1.com/sdr/1365_768_/t010d55b47d360d1ea4.jpg"
root = "D://picture//"
path = root + url.split('/')[-1]

try:
    if not os.path.exists(root):
        os.mkdir(root) # 判断文件是否存在，不存在新建
    if not os.path.exists(path):
        r = requests.get(url)
        with open(path,'wb') as f:
            f.write(r.content)
            f.close()
            print u'文件保存成功'
    else:
        print u'文件已存在'
except:
    print u"爬取失败"

#ip地址查询    
import requests 
url = "http://www.ip138.com/ips138.asp?ip="
ip = '222.30.196.193'
try:
    r=requests.get(url+ip)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print len(r.text)
    print r.text[-700:]
except:
    print u'爬取失败'


#手机号码归属地查询
import requests 
url = "http://www.ip138.com:8080/search.asp?mobile="
mobile = '1593183'
try:
    r=requests.get(url+mobile)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print len(r.text)
    print r.text[-700:]
except:
    print u'爬取失败'

#通过观察 按钮后url的变化，节省人工去按

kv = {'k': 'v', 'x': 'y'} 
r = requests.request('GET', 'http://python123.io/ws', params=kv) 
print r.url

 ->http://pyhton123.io/ws?x=y&k=v