DataWhale 组队学习爬虫 Task1-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_37855575/article/details/98623048

使用urllib库

1. 使用urlopen实现简单的get请求

import urllib.request
url = 'http://www.baidu.com'
response = urllib.request.urlopen(url)

print(type(response)) #返回response的类型

response是一个HTTPResponse类型的对象，主要包含read()、readinto（）、getheader（name）、getheaders（）、fileno（）等方法。

response.read() 返回获取到的网页内容（需要用.decode('utf-8'）
print(response.read().decode('utf-8'))

response.getheaders() -- 返回头部信息

print(response.getheaders())

response.getheader('server') -- 返回头部信

print(response.getheader('server'))

response.status - 返回状态码

print(response.status)

2. 使用reqeust - 如果请求中需要加入Headers等信息

构造格式：

Request对象 = urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, mathod=None)

url - 链接，必须，其他可选

data - 必须传bytes（字节流）类型的，如果它是字典，可以先用urllib.parse模块里的urlencode（）编码

headers - 字典类型，直接传入字典

origin_req_host - 请求方的host名称或者IP地址

unverifiable - 是否无法验证请求，默认false

method - 字符串烈性，用于明确请求的方法：GET、POST、和PUT等

爬一下老罗的微博，看看会返回什么。。。

注意几点：(从其他地方学到的)

1. F12调出Console，将网页转成手机页面格式

2. 点击网络，然后刷新页面

3. 点击消息头，获取headers的参数数据，针对获取不到信息的情况下，可以添加user-agent, host, cookie等参数（复制头信息的时候，要点击原始头，这样比较好copy）

代码如下：

from urllib import request, parse
url='http://m.weibo.cn/detail/4399018165253970'
headers = {'Connection':'keep-alive',
           'host':'m.weibo.cn',
           'User-Agent':'Mozilla/5.0 (Windows NT 10.0;) Gecko/20100101 Firefox/60.0',
           'Cookie':'_T_WM=64487211353; MLOGIN=0; WEIBOCN_FROM=1110006030; M_WEIBOCN_PARAMS=uicode%3D20000061%26fid%3D4399018165253970%26oid%3D4399018165253970'}

req = request.Request(url=url, headers=headers,method='GET')
response = request.urlopen(req)
print(response.status)
print(response.read().decode('utf-8'))

成功返回。获取到数据之后就是数据处理的部分了，涉及到正则表达式等方法。

使用Request库

1.GET请求

按照讲义中的网址，返回的是json格式数据，因此需要用response对象.json()转成字典

import requests
r = requests.get('http://httpbin.org/get')
print(r.text)

转成字典，这样就可以通过字典的键获取对应的内容

print(type(r.text))
print(r.json())
print(type(r.json()))

如果返回的不是json格式，就会报错，比如获取csdn首页的信息，response对象.text是返回html的代码，因此不能用json()

r = requests.get('http://www.youkuaiyun.com')
print(r.text)

使用json则会报解析错误

r = requests.get('http://www.youkuaiyun.com')
print(r.json())

2. POST请求

首先，GET 和 POST 请求的区别参考：https://www.cnblogs.com/logsharing/p/8448446.html

import requests
url = 'http://httpbin.org/post'
data = {'name': 'germey', 'age': '22'}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Firefox/60.0',
    'Host': 'httpbin.org'
    }
r = requests.post(url, headers=headers, data=data)
print(r.text)

返回的结果

如果取消data 的传递

r = requests.post(url, headers=headers)
print(r.text)

则返回：

可以看到 form里面就是传递的参数，也证明前面POST请求成功发送

---End

DataWhale 组队学习爬虫 Task1