爬虫学习之urllib详解

最新推荐文章于 2024-07-10 21:29:41 发布

云风Com

最新推荐文章于 2024-07-10 21:29:41 发布

阅读量341

点赞数

分类专栏： python

本文链接：https://blog.youkuaiyun.com/weixin_46318370/article/details/108657997

版权

python 专栏收录该内容

5 篇文章

订阅专栏

进入主程序入口

if __name__  == "__main__":
    print('hello')

urllib

get请求

import urllib.request
#get方式
response = urllib.request.urlopen("http://www.baidu.com")
print(response)

运行结果

<http.client.HTTPResponse object at 0x000002A5EE9B8208>

这是因为urlopen返回的是一个respose的对象，使用.read()方法就可以读出来内容

print(response.read())

为防止乱码，可以使用decode方法

print(response.read().decode('utf-8'))

post请求
下面这个网址用来测试
http://httpbin.org/
- 获取post请求
  执行下面的代码会报错，原因是post请求必须传递给其一些参数

response = urllib.request.urlopen('http://httpbin.org/post')
print(response.read())

而正确的方式是用下面的方法

import urllib.parse

data = bytes(urllib.parse.urlencode({'hello': 'world'}), encoding='utf-8')
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read().decode('utf-8'))

解释：通过urlencode方法将字典解析，再通过bytes转化成二进制形式数据包，最后以参数形式传入到urlopen方法中

超时处理
下面这样的访问会超时，

response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.01)
print(response.read().decode('utf-8'))

报的错误如下

urllib.error.URLError: <urlopen error timed out>

用try,except来防止报错

try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.01)
    print(response.read().decode('utf-8'))
except urllib.error.URLError as e:
    print('time out')

运行结果

<http.client.HTTPResponse object at 0x0000017F5E19ED88>
time out

状态码 response.status，响应头response.getheaders()

response = urllib.request.urlopen('http://httpbin.org/get')
print(response.status)

运行结果

通过下面方法可以获得响应头

response = urllib.request.urlopen('http://httpbin.org/get')
print(response.getheaders())

运行部分结果

[('Date', 'Fri, 18 Sep 2020 01:43:30 GMT'), 
('Content-Type', 'application/json'), ('Content-Length', '272')]

如果要部分响应头可以用以下方法

response = urllib.request.urlopen('http://httpbin.org/get')
print(response.getheader('Content-Type'))

真正开始模拟浏览器访问

利用Request对象来封装Url,
之后再调用urlopen方法

url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
}
data = bytes(urllib.parse.urlencode({'hello': 15}), encoding='utf-8')
req = urllib.request.Request(url=url, data=data, headers=headers, method='POST')
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))

下面来访问一下豆瓣

url = 'https://www.douban.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
}
req = urllib.request.Request(url=url,  headers=headers)
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))