BeautifulSoup4 UserWarning

本文介绍了解决使用BeautifulSoup时出现的警告方法,并提供了不同解析器的使用方式及其优缺点对比,帮助开发者选择合适的HTML解析器。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

错误描述:

/opt/ActivePython-2.7/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.



To get rid of this warning, change this:


 BeautifulSoup([your markup])


to this:


 BeautifulSoup([your markup], "lxml")


  markup_type=markup_type))


解决:

初始化时,加上解析器类型,常用解析器如下:

解析器 使用方法 优势 劣势
Python标准库 BeautifulSoup(markup, "html.parser")
  • Python的内置标准库
  • 执行速度适中
  • 文档容错能力强
  • Python 2.7.3 or 3.2.2)前 的版本中文档容错能力差
lxml HTML 解析器 BeautifulSoup(markup, "lxml")
  • 速度快
  • 文档容错能力强
  • 需要安装C语言库
lxml XML 解析器

BeautifulSoup(markup, ["lxml", "xml"])

BeautifulSoup(markup, "xml")

  • 速度快
  • 唯一支持XML的解析器
  • 需要安装C语言库
html5lib BeautifulSoup(markup, "html5lib")
  • 最好的容错性
  • 以浏览器的方式解析文档
  • 生成HTML5格式的文档
  • 速度慢
  • 不依赖外部扩展

import os import requests from bs4 import BeautifulSoup import hashlib import json SAVE_DIR = r"D:\Onedrive\桌面\search engine" DOCS_DIR = os.path.join(SAVE_DIR, "docs") META_FILE = os.path.join(SAVE_DIR, "meta.json") os.makedirs(DOCS_DIR, exist_ok=True) def fetch(url): try: resp = requests.get(url, timeout=10) resp.encoding = resp.apparent_encoding html = resp.text soup = BeautifulSoup(html, "html.parser") title = soup.title.string.strip() if soup.title and soup.title.string else url for script in soup(["script", "style"]): script.decompose() text = soup.get_text(separator=" ", strip=True) summary = text[:200] return title, summary, text except Exception as e: print(f"Failed {url}: {e}") return url, "", "" def save_doc(docid, text): path = os.path.join(DOCS_DIR, f"{docid}.txt") with open(path, "w", encoding="utf-8") as f: f.write(text) if __name__ == "__main__": # 填写你要爬的URL列表 urls = [ "https://www.python.org/", "https://zh.wikipedia.org/wiki/Python", "https://baike.baidu.com/item/Python/407313" # 可加更多... ] meta = {} for url in urls: docid = hashlib.md5(url.encode()).hexdigest() title, summary, text = fetch(url) if text.strip(): save_doc(docid, text) meta[docid] = {"url": url, "title": title, "summary": summary} print(f"Saved {url}") with open(META_FILE, "w", encoding="utf-8") as f: json.dump(meta, f, ensure_ascii=False, indent=2) print("Meta生成完毕。") 运行代码得出结果 D:\pycharm-professional-2024.3.4.exe\PythonProject\.venv\Scripts\python.exe D:\pycharm-professional-2024.3.4.exe\PythonProject\meta_generator.py D:\pycharm-professional-2024.3.4.exe\PythonProject\.venv\Lib\site-packages\jieba\_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources Traceback (most recent call last): File "D:\pycharm-professional-2024.3.4.exe\PythonProject\meta_generator.py", line 12, in <module> meta = json.load(f) File "D:\python\Lib\json\__init__.py", line 293, in load return loads(fp.read(), cls=cls, object_hook=object_hook, parse_float=parse_float, parse_int=parse_int, parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) File "D:\python\Lib\json\__init__.py", line 346, in loads return _default_decoder.decode(s) ~~~~~~~~~~~~~~~~~~~~~~~^^^ File "D:\python\Lib\json\decoder.py", line 345, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^ File "D:\python\Lib\json\decoder.py", line 363, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) 怎么改?
06-04
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值