阿里微认证之python 爬虫-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_42799459/article/details/100089533

本文深入讲解了爬虫技术的关键概念，包括robots协议的作用、HTTP请求处理及响应处理流程。介绍了Python标准库urllib的使用方法，如urllib.request模块的urlopen方法、Request类的构造与使用，以及如何解析URL和处理User-Agent问题。同时，涵盖了urllib.parse模块的URL编解码功能和ssl模块的证书处理技巧。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

robots协议

指定一个文件，告诉爬虫引擎可以爬什么，不能爬什么。
eg:
淘宝
https://www.taobao.com/robots.txt
马蜂窝
http://www.mafengwo.com/robots.txt
这个协议是君子协议,不是强制性要求

HTTP请求处理和响应处理

urllib包是标准库

urllib.request用于打开和读写url
urllib.error包含了有urllib.request引起的异常
urllib.parse用于解析url
urllib.robotparser分析robots.txt文件
python3中只有urllib一个包

urllib.request模块

模块定义了在基本和摘要式和身份验证、重定向、cookies等应用中打开URl（主要是HTTP）的函数和类

urlopen方法

urllib.request.urlopen(url,data=None)

打开一个url可以是一个string或者对象
data是提交的数据如果data为None则为GET请求，否则为POST请求

from urllib.request import urlopen
# 打开一个url返回一个响应对象，类文件对象
# 下面的连接访问后会有跳转
response = urlopen("http://www.bing.com") # GET方法
    print(response.closed)
with response:
    print(1, type(response)) # http.client.HTTPResponose 类文件对象
    print(2, response.status, response.reason) # 状态
    print(3, response.geturl()) # 返回真正的URL
    print(4, response.info()) # headers
    print(5, response.read()) # 读取返回的内容
print(response.closed)

User-Agent问题

构造user-agent，伪装用户浏览器
调用Request类

Requset(url, data=None, headers={})
初始化方法，构造一个请求对象，可添加一个header字典。data参数决定是GET还是POST请求。
add_header(key, val)为header中增加一个键值对

from urllib.request import  Request
from urllib.request import urlopen, Request


class PaChong(object):
    """爬虫练习"""
    def __init__(self, url):
        self.start_list = []
        self.end_list = []
        self.url = url
        self.user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3879.0 Safari/537.36 Edg/78.0.249.1"
        self.req = Request(url, headers={
            "User-agent": self.user_agent,
        })
        # self.req.add_header("user-agent", self.user_agent) # 第二种增加请求头的方法add_header里面自动会将首字母大写


    def get_request(self):
        self.response = urlopen(self.req)
        print(self.response.close)
        with self.response:
            print(1, type(self.response))  # http.client.HTTPResponose 类文件对象
            print(2, self.response.status, self.response.reason)  # 状态
            print(3, self.response.geturl())  # 返回真正的URL
            print(4, self.response.info())  # headers
            print(5, self.response.read())  # 读取返回的内容
        print(self.response.closed)
        # contnet = self.response.read().decode("utf-8")
        # print(contnet)


baidu = PaChong("http://www.bing.com/")

baidu.get_request()

urllib.parse模块

该模块可以完成对url的编解码

from urllib import parse
def parse_test():
    d = {
        "id": 1,
        "name": "张三",
        "like": "play basketball",
        "http": "http://www.baidu.com/?parse=q&parse_2=李四"
    }
    u = parse.urlencode(d)
    print(u)


parse_test()
# 运行结果如下：
id=1&name=%E5%BC%A0%E4%B8%89&like=play+basketball&http=http%3A%2F%2Fwww.baidu.com%2F%3Fparse%3Dq%26parse_2%3D%E6%9D%8E%E5%9B%9B

x = parse.unquote(u) # 解码
print(x)
# 运行结果如下：
id=1&name=张三&like=play+basketball&http=http://www.baidu.com/?parse=q&parse_2=李四

ssl模块，忽略不信任证书

import ssl
context = ssl._createunverified_context()

urllib3第三方库

提供了连接池功能