略讲BeautifulSoup

最新推荐文章于 2025-07-06 18:50:01 发布

jieleiping

最新推荐文章于 2025-07-06 18:50:01 发布

阅读量518

点赞数

CC 4.0 BY-SA版权

分类专栏： python

本文链接：https://blog.youkuaiyun.com/jieleiping/article/details/59056981

python 专栏收录该内容

4 篇文章

订阅专栏

BeautifulSoup是一个解析html，xml这类格式化文档的利器。
pip安装BeautifulSoup的命令是：
pip install beautifulsoup4
在python中引入响应的包：
from bs4 import BeautifulSoup

1. BeautifulSoup的初始化
初始化可以使用格式化的字符串，也可以使用本地文件，如下所示：
m1 = BeautifulSoup('<a href=\'www.baidu.com\'>baidu</a>', "html.parser")
m2 = BeautifulSoup(open("index.html"))
html.parser是显式的制定文档解析器，否则会出现警告错误：
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
此外，BeautifulSoup还能将没有配对的Tag标签自动补全，如下：
print(BeautifulSoup("<a href='www.baidu.com'>baidu", "html.parser"))
其结果为：
<a href="www.baidu.com">baidu</a>

2. BeautifulSoup对象
BeautifulSoup会将文档内容解析成一个树形的结构，每个节点都是一个Python对象，对象有4种中：BeautifulSoup，Tag，NavigableSting，Comment
以下面这段html代码为例：
<html>
<head>
<title>This is title</title>
</head>
<body>

Football
basketball
volleyball
baseball
</body>
</html>
我们使用下面的代码来了解这4种Python对象分别对应文档中的什么内容：
html = """
<html>
<head>
<title>This is title</title>
</head>
<body>

Football
basketball
volleyball
baseball
</body>
</html>
"""
sp = BeautifulSoup(html, "html.parser")
print(type(sp))
print(type(sp.html))
print(type(sp.title.string))
print(type(sp.body.b.string))
输出结果为：
<class 'bs4.BeautifulSoup'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'bs4.element.Comment'>
（1）BeatuifulSoup对象表示一个文档的全部内容，可以理解成文档解析树结构的根节点，根节点下面挂着其他3类节点。
（2）Tag对象表示文档中的标签以及标签中间的内容。Tag对象包含几个常用的方法，name返回Tag对象的名称，attrs返回Tag对象中所有的属性信息的字典。
（3）NavigableString是标签内部的内容，可以由.string方法得到。
（4）Comment对象对应文档中一些如注释，CData，Declaration，Doctype等。Comment 对象是一个特殊类型的 NavigableString 对象。有一点需要注意，输入Comment对象内容时会去掉一些信息，如注释符号。

3. 文档解析
既然文档被BeautifulSoup解析以后是一棵树状结构，那么自然而然会有一些树的方法，比如parent, child, children, next_sibling, next_siblings, previous_sibling, descendants等方法。下面主要介绍几种最常用的。
（1）find_all( name , attrs , recursive , text , **kwargs )，该方法可以查找所有满足条件的节点。
name是Tag的名字, sp.find_all('p')
attrs是属性对，例如sp.find_all(attrs={"name": "zuqiu"})
recursive控制是只在子节点中搜索还是在所有子孙节点中搜索；
limit控制最后返回的数量，毕竟有时候我们只想找一部分满足条件的内容；
find_all返回的结果是列表，如果没有找到符合条件的节点，则返回空列表。

（2）find( name , attrs , recursive , text , **kwargs )
find函数与find_all类似，如同find_all中的limit设置为1，find找到第一个满足条件的节点并返回，否则返回None。
find_all('p', limit=1)与find('p')效果一样。

下面举例说明
print(sp.find_all('p'))
#所有Tag为p的节点
print(sp.find_all(attrs={"class": "sports"}))
#所有class为sports的节点
print(sp.find_all('p', attrs={"class": "sports"}))
#所有Tag为p，并且class是sports的节点
print(sp.find_all('a', attrs={"class": re.compile("sp")}))
#所有Tag为a，并且class包含sp的节点，实现模糊搜索
print(sp.find_all(class_="sports"))
#使用class_关键字匹配class为sports的所有节点

其他的用法可以查阅文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html