python urllib/parse.py对中文链接处理问题临时处治

在使用Python的PyQuery库进行网页爬取时,遇到含有中文的URL导致ValueError异常,详细介绍了错误信息及临时解决方案,即注释掉urllib.parse模块中检查URL规范化部分的代码。
部署运行你感兴趣的模型镜像

爬取一个网页出错。

抛出以下错误

netloc '微信小程序:某某某' contains invalid characters under NFKC normalization f')) File "/usr/local/lib/python3.6/site-packages/pyquery/pyquery.py", line 714, in each if callback(func, i, element) is False: File "/usr/local/lib/python3.6/site-packages/pyquery/pyquery.py", line 132, in callback return func(*args[:func_code(func).co_argcount]) File "/usr/local/lib/python3.6/site-packages/pyquery/pyquery.py", line 1690, in rep urljoin(base_url, attr_value.strip())) File "/usr/lib64/python3.6/urllib/parse.py", line 512, in urljoin urlparse(url, bscheme, allow_fragments) File "/usr/lib64/python3.6/urllib/parse.py", line 368, in urlparse splitresult = urlsplit(url, scheme, allow_fragments) File "/usr/lib64/python3.6/urllib/parse.py", line 441, in urlsplit _checknetloc(netloc) File "/usr/lib64/python3.6/urllib/parse.py", line 410, in _checknetloc "characters under NFKC normalization") ValueError: netloc '微信小程序:某某某' contains invalid characters under NFKC normalization

 

调试发现处理以下类型的链接时parse.py出错

<a href="http://微信小程序:某某某某" target="_blank">

查找资料,

unicodedata.normalize是对URL进行规范化

估计是中文URL出错

网上没查到解决办法,也没精力细研究,暂时注释掉这块处理

把parse.py的407到410行注释掉


vi /usr/lib64/python3.6/urllib/parse.py:

    394 def _checknetloc(netloc):
    395     if not netloc or not any(ord(c) > 127 for c in netloc):
    396         return
    397     # looking for characters like \u2100 that expand to 'a/c'
    398     # IDNA uses NFKC equivalence, so normalize for this check
    399     import unicodedata
    400     n = netloc.replace('@', '')   # ignore characters already included
    401     n = n.replace(':', '')        # but not the surrounding text
    402     n = n.replace('#', '')
    403     n = n.replace('?', '')
    404     netloc2 = unicodedata.normalize('NFKC', n)
    405     if n == netloc2:
    406         return
    407#    for c in '/?#@:':
    408#         if c in netloc2:
    409#             raise ValueError("netloc '" + netloc + "' contains invalid " +
    410#                              "characters under NFKC normalization")

 

期待有更好的方法解决

您可能感兴趣的与本文相关的镜像

Python3.10

Python3.10

Conda
Python

Python 是一种高级、解释型、通用的编程语言,以其简洁易读的语法而闻名,适用于广泛的应用,包括Web开发、数据分析、人工智能和自动化脚本

from models.common import DetectMultiBackend File "/home/chenchengzhang/PycharmProjects/yolov5/models/common.py", line 27, in <module> from utils.plots import Annotator, colors, save_one_box File "/home/chenchengzhang/PycharmProjects/yolov5/utils/plots.py", line 68, in <module> class Annotator: File "/home/chenchengzhang/PycharmProjects/yolov5/utils/plots.py", line 70, in Annotator check_font() # download TTF if necessary File "/home/chenchengzhang/PycharmProjects/yolov5/utils/plots.py", line 61, in check_font torch.hub.download_url_to_file(url, str(font), progress=False) File "/home/chenchengzhang/anaconda3/envs/mytorch/lib/python3.8/site-packages/torch/hub.py", line 624, in download_url_to_file u = urlopen(req) File "/home/chenchengzhang/anaconda3/envs/mytorch/lib/python3.8/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout) File "/home/chenchengzhang/anaconda3/envs/mytorch/lib/python3.8/urllib/request.py", line 531, in open response = meth(req, response) File "/home/chenchengzhang/anaconda3/envs/mytorch/lib/python3.8/urllib/request.py", line 640, in http_response response = self.parent.error( File "/home/chenchengzhang/anaconda3/envs/mytorch/lib/python3.8/urllib/request.py", line 563, in error result = self._call_chain(*args) File "/home/chenchengzhang/anaconda3/envs/mytorch/lib/python3.8/urllib/request.py", line 502, in _call_chain result = func(*args) File "/home/chenchengzhang/anaconda3/envs/mytorch/lib/python3.8/urllib/request.py", line 755, in http_error_302 return self.parent.open(new, timeout=req.timeout) File "/home/chenchengzhang/anaconda3/envs/mytorch/lib/python3.8/urllib/request.py", line 531, in open response = meth(req, response) File "/home/chenchengzhang/anaconda3/envs/mytorch/lib/python3.8/urllib/request.py", line 640, in http_response response = self.parent.error( File "/home/chenchengzhang/anaconda3/envs/mytorch/lib/python3.8/urllib/request.py", line 563, in error result = self._call_chain(*args) File "/home/chenchengzhang/anaconda3/envs/mytorch/lib/python3.8/urllib/request.py", line 502, in _call_chain result = func(*args) File "/home/chenchengzhang/anaconda3/envs/mytorch/lib/python3.8/urllib/request.py", line 755, in http_error_302 return self.parent.open(new, timeout=req.timeout) File "/home/chenchengzhang/anaconda3/envs/mytorch/lib/python3.8/urllib/request.py", line 525, in open response = self._open(req, data) File "/home/chenchengzhang/anaconda3/envs/mytorch/lib/python3.8/urllib/request.py", line 542, in _open result = self._call_chain(self.handle_open, protocol, protocol + File "/home/chenchengzhang/anaconda3/envs/mytorch/lib/python3.8/urllib/request.py", line 502, in _call_chain result = func(*args) File "/home/chenchengzhang/anaconda3/envs/mytorch/lib/python3.8/urllib/request.py", line 1397, in https_open return self.do_open(http.client.HTTPSConnection, req, File "/home/chenchengzhang/anaconda3/envs/mytorch/lib/python3.8/urllib/request.py", line 1357, in do_open raise URLError(err) urllib.error.URLError: <urlopen error [Errno 104] Connection reset by peer>
07-25
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值