爬虫基础之requests与Urllib-优快云博客

本文链接：https://blog.youkuaiyun.com/Zhang__ZQ/article/details/114747072

爬虫基础之一

requests的基本用法
Urllib库介绍
- Urllib是什么

requests的基本用法

import requests
response.requests()

requests.get('https://weibo.com/') #GET请求

requests.post('https://weibo.com/') #POST请求

requests.put('https://weibo.com/') #PUT请求(提交修改全部的数据)

requests.delete('https://weibo.com/') #DELETE请求

requests.head('https://weibo.com/') #HEAD请求

requests.patch('https://weibo.com/') #PATCH请求(提交修改部分数据)

一般我们都使用GET和POST这两种方法爬取网页

>>> import requests
>>> response = requests.get('https://www.weibo.com/')
#使用requests之后会返回一个response的对象
#其里面存储了服务器响应的内容
>>> response
<Response [200]>
>>> response.text
	#微博网页源码   
>>> response.encoding
'ISO-8859-1'
#这里response的编码方式是ISO，我们可以通过以下代码将编码方式更改为utf-8
>>>response.encoding = 'utf-8'
>>> response.encoding
'etf-8'

常用属性：

>>>r.status_code #HTTP响应状态码，200表示响应成功，404表示失败
>>>r.content #HTTP响应内容的二进制形式
>>>r.text #字符串方式的响应体，
>>>r.headers #以字典对象存储服务器响应头，但是这个字典比较特殊，字典键不区分大小写，若键不存在则返回None
>>>#(要注意区分r.headers，与前面的headers参数字段，前者只是Response对象的一个属性，后者是传递的参数)
>>>r.encoding#从HTTP头header中提取响应内容的编码方式(这个编码方式不一定存在)
>>>r.apparent_encoding#从内容中分析出响应内容的编码方式(这个编码方式是绝对正确的)
>>>r.raw #返回原始响应体，也就是 urllib 的 response 对象，使用 r.raw.read() 读取

在这里有一个比较特殊的属性: r.request.headers可以查看HTTP请求的头部，注意区分r.headers

常用方法：

>>>r.raise_for_status() #失败请求(非200响应)抛出requests.HTTPError异常
Requests库的异常：
>>>requests.ConnectionError: 网络连接错误异常，如DNS查询失败，拒接连接等
>>>requests.HTTPError: HTTP错误异常
>>>requests.URLRequired: URL缺失异常
>>>requests.TooManyRedirects: 超过最大重定向次数，产生的重定向异常
>>>requests.ConnectTimeout: 远程连接服务器异常超时
>>>requests.Timeout: 请求URL超时，产生的超时异常

requests的应用:

想用requests来访问网页，还有个最重要的，就是设置requests对象response的头部信息
因为服务器可以通过读取请求头部的用户代理（user agent）来判断这个请求是正常浏览器还是爬虫

>>> url = 'https://www.weibo.com/'
>>> headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'}
>>> response = requests.get(url, headers=headers)

以上，就可以使用requests来正常访问网页了

Urllib库介绍

Urllib是什么

Urllib是一个用户操作URL的模块，其主要含有以下四个模块，
urllib.request———打开和读取URL
urllib.error———包含Urllib.request 各种错误的模块
urllib.parse———解析URL
urllib.robotparse———解析网站robots.txt的文件

Urllib的基本用法
下面是如何使用Urllib库发送GET请求
（还是以获取微博首页代码为例）

>>> from urllib.request import urlopen
>>> html = urlopen('https://www.weibo.com/')
>>> response = html.read()
>>> print(response)