Python编程-使用urllib进行网络爬虫常用内容梳理

半只野指针

已于 2024-01-30 11:59:58 修改

阅读量1.2k

点赞数 31

分类专栏： Python 文章标签： python 爬虫开发语言

于 2024-01-30 11:52:23 首次发布

本文链接：https://blog.youkuaiyun.com/m0_74220316/article/details/135929778

版权

Python 专栏收录该内容

11 篇文章

订阅专栏

Python编程-使用urllib进行网络爬虫常用内容梳理

使用urllib库进行基础网络请求

使用request发起网络请求

from urllib import request
from http.client import HTTPResponse

response: HTTPResponse = request.urlopen(url="http://pkc/vul/sqli/sqli_str.php")
print(response.getcode())
print(response.read().decode('utf-8'))

HTTPResponse常见的属性与方法

方法/属性	描述
`read(size=-1)`	读取并返回指定大小的响应体。如果未指定大小，将读取整个响应体。
`readline(limit=-1)`	读取并返回响应体中的一行。如果未指定大小，将读取整行，参数用于指定字符数。
`readlines()`	读取并返回响应体中的所有行。
`getheader(name, default=None)`	返回指定头部名称的头部值。如果未找到，返回默认值。
`getheaders()`	返回一个包含所有响应头部的列表。
`status`	响应的状态码。例如，200 表示成功，404 表示未找到，等等。
`version`	HTTP 版本。通常是 “HTTP/1.0” 或 “HTTP/1.1”。
`reason`	对状态码的短语性描述。例如，对于状态码 200，原因可能是 “OK”。
`msg`	完整的 HTTP 响应消息，包括状态行和头部。
`headers`	一个类似字典的对象，包含响应头的键值对。
`geturl()`	返回实际请求的 URL。如果请求是重定向的结果，则返回最终 URL。
`info()`	返回一个包含有关响应的信息的类似字典的对象。
`getcode()`	返回响应的状态码，例如 200 表示成功。

urlopen的参数使用

def urlopen(
    url: str | Request,
    data: _DataType = None,
    timeout: float | None = ...,
    *,
    cafile: str | None = None,
    capath: str | None = None,
    cadefault: bool = False,
    context: SSLContext | None = None
) -> _UrlopenRet

data参数用于接收一个字节流对象，一旦指定了参数data，将会使得本次请求自动转化为post

from urllib import request
from http.client import HTTPResponse


bytes_data: bytes = bytes('Hello, World!', 'utf-8')
response: HTTPResponse = request.urlopen(url="https://httpbin.org/post", data=bytes_data)
print(response.read().decode('utf-8'))

bytes类型的构造有两个参数，一个是字符串，一个是编码方式（可选）

我们对于上述代码可以在https://httpbin.org/网站进行验证，在开始post测试后，上述代码将会输出我们的请求中的信息，我们将在控制台看到以下内容（截取了部分，减小篇幅）：
{
"args": {},
"data": "",
"files": {},
"form": {
 "Hello, World!": ""
},
 ...
}

timeout参数用于指定请求的响应时间，超时将会抛出URLError异常（该异常定义在urllib.error中），我们通常以下列语句测试

from urllib import request, error
from socket import timeout
from http.client import HTTPResponse


bytes_data: bytes = bytes('Hello, World!', 'utf-8')
try:
    response: HTTPResponse = request.urlopen(url="https://httpbin.org/post", data=bytes_data, timeout=0.1)
except error.URLError as e:
    if isinstance(e.reason, timeout):
        print("A connection timeout occurred while accessing the target website")
else:
    print("Access target state code is: ", response.status)

其他参数均与ca证书相关可待使用时进行探讨，其中的cadefault已经弃用

context参数：它必须是 ss1.SSLContext类型，用来指定SSL 设置。
capath参数：用来指定ca证书的路径
cafile参数：用来指定ca证书文件

使用Request对象发起网络请求

相较于直接使用urlopen，使用Request对象的场景要更加常见，它提供了更加灵活的网络请求方式，我们来看Request的构造方法：

Request.__init__(url: str, data: _DataType = None, headers: MutableMapping[str, str] = {}, origin_req_host: str | None = None, unverifiable: bool = False, method: str | None = None) -> None

url：需要请求的网站，是构造的必选参数，其他选项是可选的
data：需要字节流类型，需要进行转换
headers：用于定义请求头的字典，既可以在构造时进行添加，也可以在后续以add_header方法添加
origin_req_host：发起网络请求方的host
unverifiable：通常在访问一个网站时，由于证书等原因无法验证连接的安全性，当这个值为False时会中止请求，如果说我们想要跳过验证，强行访问就可以修改其为True
method：接受一个字符串，作为请求的类型指定，需要注意的是，这个必须要大写，否则将会引发HTTPError错误

from urllib.request import Request, urlopen
from urllib.error import URLError
from socket import timeout
from http.client import HTTPResponse


url: str = 'https://httpbin.org/post'
bytes_data: bytes = bytes("zhi yin ni tai mei", encoding='utf-8')
headers: dict[str, str] = { 'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
    '(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' }
request_object: Request = Request(url=url, data=bytes_data, headers=headers, method='POST')
try:
    response: HTTPResponse | None = urlopen(request_object, timeout=4)
except URLError as e:
    response = None
    if isinstance(e.reason, timeout):
        print("A connection timeout occurred while accessing the target website")
else:
    print("Access target state code is: ", response.status)
finally:
    if response:
        print(response.read().decode('utf-8'))

使用add_hearder添加字段

url: str = 'https://httpbin.org/post'
bytes_data: bytes = bytes("zhi yin ni tai mei", encoding='utf-8')
request_object: Request = Request(url=url, data=bytes_data, method='POST')
request_object.add_header('User-Agent', 
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
    '(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36')

使用Handler类与OpenerDirector

Handler类是一系列继承自request中的BaseHandler类，它们用于支持各种网络请求中的高级操作，常用的有以下几个：

处理器	描述
`HTTPDefaultErrorHandler`	用于处理 HTTP 请求中的响应错误，即 `HTTPError` 类型的异常。
`HTTPRedirectHandler`	用于处理请求中的各种重定向问题。
`HTTPCookieProcessor`	用于专门处理 Cookies 问题。
`ProxyHandler`	用于设置网络代理的管理。
`HTTPPasswordMgr`	用于管理密码与用户名的表，通常与 `HTTPBasicAuthHandler` 配合使用。
`HTTPBasicAuthHandler`	用于管理连接打开时可能需要的基本认证操作。
`HTTPPasswordMgrWithDefaultRealm`	用于管理密码与用户名的表，同时允许默认域的设置。

OpenerDirector 是 urllib.request 模块中的一个类，用于处理 URL 请求的打开器。OpenerDirector 类提供了一个通用的接口，使得你可以通过添加不同的处理器来处理不同类型的 URL 请求

方法和异常	描述
`add_handler(handler)`	添加一个处理器到打开器中。处理器是一个对象，定义了如何处理特定类型的 URL 请求。常见处理器包括 `HTTPHandler`、`HTTPSHandler`、`FTPHandler` 等。
`open(url, data=None, timeout=<default>, cafile=None, capath=None, cadefault=False, context=None)`	打开指定的 URL。根据 URL 的协议选择合适的处理器来处理请求。
`open(req, data=None, timeout=<default>, cafile=None, capath=None, cadefault=False, context=None)`	通过传递 `Request` 对象来打开 URL。`Request` 对象可以包含更多的请求信息，如请求头、请求方法等。
`error = URLError(reason, request, code, hdrs, fp)`	当发生 URL 相关的错误时，抛出 `URLError` 异常。包含错误原因 (`reason`)、请求对象 (`request`)、错误代码 (`code`)、响应头 (`hdrs`) 和文件指针 (`fp`)。

设置密码管理处理器

from urllib.request import HTTPPasswordMgrWithDefaultRealm
from urllib.request import HTTPBasicAuthHandler
from urllib.request import build_opener
from urllib.error import URLError
from http.client import HTTPResponse


default_username: str = 'username'
default_password: str = 'password'
user_define_url = 'https://httpbin.org/get' 

passwd_handler = HTTPPasswordMgrWithDefaultRealm()
passwd_handler.add_password(None, user_define_url, default_username, default_password)
auth_handler = HTTPBasicAuthHandler(passwd_handler)
opener = build_opener(auth_handler)

try:
    res: HTTPResponse = opener.open(user_define_url)
    html_document = res.read().decode('utf-8')
    print(html_document)
except URLError as e:
    print(e.reason)

urllib的error模块用于管理请求中的异常，其中的reason用于输出异常的原因

为爬虫设置代理处理器

from urllib.request import ProxyHandler, build_opener
from urllib.error import URLError
from http.client import HTTPResponse


default_username: str = 'username'
default_password: str = 'password'
user_define_url = 'https://httpbin.org/get'

default_proxy: dict[str, str] = {'http': 'http://127.0.0.1:8080',
                                'https': 'http://127.0.0.1:8080'}
proxy_handler = ProxyHandler(default_proxy)
opener = build_opener(proxy_handler)
try:
    res: HTTPResponse = opener.open(user_define_url)
    html_document = res.read().decode('utf-8')
    print(html_document)
except URLError as e:
    print(e.reason)

为cookies处理设置处理器

cookie处理与保存

from http.cookiejar import CookieJar
from urllib.request import HTTPCookieProcessor, build_opener

user_define_url: str = 'https://www.baidu.com'
user_url_cookies: CookieJar = CookieJar()
cookies_handler = HTTPCookieProcessor(user_url_cookies)
opener = build_opener(cookies_handler)

response = opener.open(user_define_url)
for cookie_item in user_url_cookies:
    print(cookie_item.name, "  ", cookie_item.value)

上述代码将输出访问百度时自动分配的cookies，我们还可以将内容保存在文件中：

from http.cookiejar import MozillaCookieJar
from urllib.request import HTTPCookieProcessor, build_opener

default_cookie_file: str = 'temp_cookies.txt'
user_define_url: str = 'https://www.baidu.com'

user_url_cookies: MozillaCookieJar = MozillaCookieJar(default_cookie_file)

cookies_handler = HTTPCookieProcessor(user_url_cookies)
opener = build_opener(cookies_handler)
response = opener.open(user_define_url)
user_url_cookies.save(ignore_discard=True, ignore_expires=True)
with open(default_cookie_file, 'r') as file:
    file_lines: list[str] = file.readlines()
    for line in file_lines:
        print(line)

MozillaCookieJar 是 Python 中 http.cookiejar 模块提供的一个类（继承自CookieJar），用于处理与 Mozilla 浏览器兼容的 cookie 存储和加载。通常会在它实例化时传入保存字段的文件名。http.cookiejar 模块提供了用于处理 HTTP cookies 的通用框架，而 MozillaCookieJar 则是该框架的一个特定实现，与 Mozilla 浏览器的 cookie 存储格式兼容（还有一种是LWPCookieJar）。它的save方法有以下参数
filename: 指定保存 cookie 的文件名。可以是字符串，也可以是类文件对象。如果不提供此参数，将使用 CookieJar 实例在创建时指定的文件名。
cookie_jar.save(filename='cookies.txt')
ignore_discard: 如果设置为 True，则即使 cookie 被标记为丢弃（discard），也会被保存。默认为 False。
cookie_jar.save(ignore_discard=True)
ignore_expires: 如果设置为 True，则即使 cookie 过期，也会被保存。默认为 False。
cookie_jar.save(ignore_expires=True)
这些参数允许你在保存 cookie 时有一定的灵活性。通常，你可以选择忽略已标记为丢弃的 cookie 或已过期的 cookie，以确保在下一次加载 cookie 时能够包括更多的信息。

如何使用已保存cookie

load 方法用于从文件中加载保存的 cookie 数据，并将其恢复到 CookieJar 实例中。参数 ignore_discard 和 ignore_expires 控制是否忽略已标记为丢弃或已过期的 cookie。这两个参数的默认值都是 False。

from http.cookiejar import MozillaCookieJar
from urllib.request import HTTPCookieProcessor, build_opener

default_cookie_file: str = 'temp_cookies.txt'
user_define_url: str = 'https://www.baidu.com'

user_url_cookies: MozillaCookieJar = MozillaCookieJar()
user_url_cookies.load(default_cookie_file, ignore_discard=True, ignore_expires=True)
cookies_handler = HTTPCookieProcessor(user_url_cookies)
opener = build_opener(cookies_handler)
response = opener.open(user_define_url)
print(response.status)

设置全局打开器

在 urllib 模块中，install_opener 方法是 urllib.request 模块中的一个函数，用于安装一个自定义的 URL 打开器（opener）作为全局默认的打开器。我们对代理处理器进行小小的修改，使他变为一个默认的全局处理器，这样会使得该程序中的请求默认使用该打开器（其他设置的打开器并不受影响）：

from urllib.request import ProxyHandler, build_opener, urlopen, install_opener
from urllib.error import URLError
from http.client import HTTPResponse


default_username: str = 'username'
default_password: str = 'password'
user_define_url = 'https://httpbin.org/get'

default_proxy: dict[str, str] = {'http': 'http://127.0.0.1:8080',
                                'https': 'http://127.0.0.1:8080'}
proxy_handler = ProxyHandler(default_proxy)
global_opener = build_opener(proxy_handler)

install_opener(opener=global_opener)

try:
    res: HTTPResponse = urlopen(user_define_url)
    html_document = res.read().decode('utf-8')
    print(html_document)
except URLError as e:
    print(e.reason)

使用urllib进行解析与编码

使用urlparse进行url识别与分段

from urllib.parse import urlparse

my_define_url: str = 'http://www.example.com/index.php;default?username=xx&passwd=xxx#comment'
parse_res = urlparse(my_define_url)
print(parse_res)
""" 输出：
    ParseResult(scheme='http', netloc='www.example.com', path='/index.php', params='default'\
    , query='username=xx&passwd=xxx', fragment='comment')
"""

它实质上返回的是一个元组类型，urlparse有以下三个参数

urlstring (必需): 要解析的URL字符串。这是唯一必需的参数，它包含要解析的完整URL。
scheme: 指定默认的协议。如果URL字符串中没有显式指定协议（如 “http://” 或 “https://”），则使用此参数指定的协议。如果未提供，将从URL字符串中提取协议（如果存在）。
allow_fragments: 控制是否解析URL中的片段标识符（fragment）。如果设置为 False，则片段标识符将被包含在路径中。默认为 True，表示片段标识符将被从路径中分离

使用urlunparse进行url组建

urlunparse要求接受长度为6的可迭代对象，然后依次组建出url（注意不要搞错顺序）：

from urllib.parse import urlunparse

my_url_subsection: tuple[str] = ('http', 'www.example.com', 'index.php', 
                                 'default','username=xx&passwd=xxx', 'comment')
parse_res = urlunparse(my_url_subsection)
print(parse_res)
""" 输出
    http://www.example.com/index.php;default?username=xx&passwd=xxx#comment
"""

使用urlsplit进行分割url

from urllib.parse import urlsplit

my_define_url: str = 'http://www.example.com/index.php;default?username=xx&passwd=xxx#comment'
parse_res = urlsplit(my_define_url)
print(parse_res)
""" 输出：
    SplitResult(scheme='http', netloc='www.example.com', path='/index.php;default'\
    , query='username=xx&passwd=xxx', fragment='comment')
"""

与urlparse类似，不过params被合并到了path，不过它实质上返回的也是元组类型

使用urlunsplit进行合并url

与urlunparse的区别是可迭代数据类型长度必须为5

from urllib.parse import urlunsplit

my_url_subsection: tuple[str] = ('http', 'www.example.com', 'index.php;default',
                                 'username=xx&passwd=xxx', 'comment')
parse_res = urlunsplit(my_url_subsection)
print(parse_res)
""" 输出
    http://www.example.com/index.php;default?username=xx&passwd=xxx#comment
"""

使用urldecode进行数据编码

from urllib.parse import urlencode

data_dict: dict[str, str] = {
    'name': 'Super Kun Kun',
    'content': 'The way you walk right in front of me makes me so excited'
}

encode_data: str = urlencode(data_dict, encoding='utf-8')

print(encode_data)
""" 输出：
    name=Super+Kun+Kun&content=The+way+you+walk+right+in+front+of+me+makes+me+so+excited
"""

使用urljoin进行合并url

urljion用来合并url，第一个参数作为基础url，第二参数作为新的url或相对url。如果是相对url，将会进行合并；如果是扩展url，urljoin将会进行比较scheme，netloc， path，如果出现不同，将会返回新的url。并且在合并过程中基础url中的params， query， fragment将会被丢弃

from urllib.parse import urljoin

print(urljoin('http://www.example.com', 'https://www.example.com'))
print(urljoin('http://www.example.com?submit=xxx', 'http://www.example.com/index.php'))
print(urljoin('http://www.example.com#submit', 'http://www.example.com/index.php'))
print(urljoin('http://www.example.com', '/index.php'))

使用parse_qs将GET参数还原

from urllib.parse import parse_qs

query_data: str = 'username=xx&password=xxx'
print(parse_qs(query_data))
# 输出 {'username': ['xx'], 'password': ['xxx']}

使用parse_qsl将GET参数还原

from urllib.parse import parse_qsl

query_data: str = 'username=xx&password=xxx'
print(parse_qsl(query_data))
# 输出 [('username', 'xx'), ('password', 'xxx')]

使用quote与unquote处理url中文字符

from urllib.parse import quote, unquote

print(quote('你好世界'))
print(unquote('%E4%BD%A0%E5%A5%BD%E4%B8%96%E7%95%8C'))
# 输出 : 
# %E4%BD%A0%E5%A5%BD%E4%B8%96%E7%95%8C
# 你好世界

处理解析robots文档

urllib中的robotparser模块用于解析robots文档，其中RobotFileParser专门用于解析，它只有一个参数，即目标url，以下是常用方法：

方法	描述
`set_url()`	用来设置 robots.txt 文件的链接。如果在创建 `RobotFileParser` 对象时传人了链接，那么就不需要再使用这个方法设置了。
`read()`	读取 robots.txt 文件并进行分析。注意，这个方法执行一个读取和分析操作，如果不调用这个方法，接下来的判断都会为 False，所以一定记得调用这个方法。这个方法不会返回任何内容，但是执行了读取操作。
`parse()`	用来解析 robots.txt 文件，传人的参数是 robots.txt 某些行的内容，它会按照 robots.txt 的语法规则来分析这些内容。
`can_fetch()`	该方法传人两个参数，第一个是 User-agent，第二个是要抓取的 URL。返回的内容是该搜索引擎是否可以抓取这个 URL，返回结果是 True 或 False。
`mtime()`	返回的是上次抓取和分析 robots.txt 的时间，这对于长时间分析和抓取的搜索爬虫是很有必要的，你可能需要定期检查来抓取最新的 robots.txt。
`modified()`	它同样对长时间分析和抓取的搜索爬虫很有帮助，将当前时间设置为上次抓取和分析 robots.txt 的时间。

一份可能的robots文档与语法解释：

# 不允许WebCrawler爬取网站
User-agent: WebCrawler
Disallow: /

# Googlebot 可以访问所有页面，但不访问 /private/ 目录和 /restricted/ 页面
User-agent: Googlebot
Disallow: /private/
Disallow: /restricted/

# Bingbot 只能访问 /public/ 目录和 /allowed-page.html 页面
User-agent: Bingbot
Allow: /public/
Allow: /allowed-page.html
Disallow: /

# 限制特定爬虫 "BadBot" 只能访问 /public/ 目录
User-agent: BadBot
Allow: /public/
Disallow: /

# 禁止所有爬虫访问 /admin/ 目录下的页面
User-agent: *
Disallow: /admin/

# 定义爬虫的抓取间隔，每次抓取间隔至少为 5 秒
Crawl-delay: 5

接下来我们在本地靶场进行测试：

from urllib.robotparser import RobotFileParser

robot_txt_parser: RobotFileParser = RobotFileParser()

allow_agent: str = 'Googlebot'
disallow_agent: str = 'WebCrawler'
user_define_url: str = 'http://192.168.179.144'
default_url_robots: str = user_define_url + '/robots.txt'
robot_txt_parser.set_url(default_url_robots)
robot_txt_parser.read()
print(robot_txt_parser.can_fetch(allow_agent, user_define_url))		# 输出True
print(robot_txt_parser.can_fetch(disallow_agent, user_define_url))	# 输出False