1.urllib:Python内置库
1.1 request 模块
-
引入该模块
from urllib import request
request.urlopen()返回一个对象
decode('utf-8') #将bytes类型转化为str
-
一个类型六个方法
-
一个类型 :
<class 'http.client.HTTPResponse'>
-
request.urlopen().read(num):按字节读取
-
request.urlopen().readline():按行读取,只读取一行
-
request.urlopen().readlines():按行读取,直到结束
-
request.urlopen().getheaders():获取请求头信息
-
request.urlopen().getcode():获取状态码
-
request.urlopen().geturl():获取url
-
1.2 ulrlib下载
request.urlretrieve(下载地址, 资源名字)
1.3 转码
-
parse 可以做转码
from urllib import request, parse
-
parse.quote('凌凌漆') :将中文转换成Unicode编码
-
parse.urlencode(data) :将参数里面的中文转换成Unicode编码, data必须是一个字典
1.4 Request():定制请求对象
-
request.Request()可以定制请求头和POST请求的参数等
url = request.Request(url=my_url, headers=headses)
-
用urlope()打开
my_html = request.urlopen(url)
1.4.1 POST请求
-
当
request.Request()
对象传入data的时候,该请求就为POST请求
1.5 异常
-
HTTPError
-
URLError
-
常用的两种异常
try: req = request.Request(url, None, headers) # 获取handler对象 handler = request.HTTPHandler() # 获取opener对象 opener = request.build_opener(handler) # 调用open()方法 response = opener.open(req) content = response.read().decode('utf-8') print(content) except error.HTTPError as e: print('No response')
1.6 Handler处理
req = request.Request(url, None, headers) # 获取handler对象 handler = request.HTTPHandler() # 获取opener对象 opener = request.build_opener(handler) # 调用open()方法 response = opener.open(req) content = response.read().decode('utf-8') print(content)
1.7 代理
-
request.ProxyHandler(proxies=proxies)
proxies = { 'http': '61.216.185.88' } req = request.Request(url, None, headers) handler = request.ProxyHandler(proxies=proxies) opener = request.build_opener(handler) response = opener.open(req)
2.反爬手段
-
设置UA
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36' }
-
设置Cookie
3.XPath解析
4. bs4解析
5.selenium
-
安装chromedriver.exe谷歌驱动,如果是其他浏览器则需要安装对应浏览器的驱动
-
安装selenium
pip install selenium
-
配置
from selenium import webdriver # 不自动关闭浏览器 option = webdriver.ChromeOptions() option.add_experimental_option("detach", True) # 将option作为参数添加到Chrome中 driver = webdriver.Chrome(chrome_options=option) url = 'https://pic.netbian.com' driver.get(url)
-
相关API
driver.forward():前进
driver.back():后退
driver.refresh():刷新页面
-
n = 3