【Python模块】9.beautifulsoup4

原创于 2025-12-08 20:44:06 发布 · 1k 阅读

8 ·

CC 4.0 BY-SA版权

文章标签：

#python #开发语言

Python模块专栏收录该内容

10 篇文章

订阅专栏

部署运行你感兴趣的模型镜像

BeautifulSoup4 系统学习指南

BeautifulSoup4（简称BS4）是Python中用于解析HTML/XML文档的库，能将复杂的HTML结构转换为树形结构，方便遍历、查找和修改节点。本文将从基础安装、核心初始化、节点查找、节点遍历、节点操作、辅助方法六个维度系统梳理，并结合案例和表格详解所有核心方法。

一、基础准备

1. 安装

# 安装核心库
pip install beautifulsoup4
# 安装解析器（推荐lxml，速度快、兼容性好）
pip install lxml
# 备选解析器：html.parser（Python内置）、html5lib（兼容性最强）

2. 解析器对比

解析器	优点	缺点	使用方式
`html.parser`	Python内置、无需额外安装	对残缺HTML支持一般	`BeautifulSoup(doc, 'html.parser')`
`lxml`	速度最快、解析效率高、容错性好	需要额外安装	`BeautifulSoup(doc, 'lxml')`
`html5lib`	兼容所有残缺HTML、符合W3C标准	速度最慢	`BeautifulSoup(doc, 'html5lib')`

3. 初始化示例

先定义一个测试HTML文档（后续所有案例均基于此）：

from bs4 import BeautifulSoup

# 测试HTML文档
html_doc = """
<html>
<head><title>测试页面</title></head>
<body>
<p class="title"><b>测试标题</b></p>
<p class="content">测试内容1</p>
<p class="content">测试内容2</p>
<a href="https://example1.com" class="link">链接1</a>
<a href="https://example2.com" class="link">链接2</a>
<div id="container">
    <p>容器内的段落</p>
</div>
</body>
</html>
"""

# 初始化BeautifulSoup对象
soup = BeautifulSoup(html_doc, 'lxml')
# 格式化输出（便于查看结构）
print(soup.prettify())

二、核心：节点查找方法

BS4的核心能力是查找节点，主要分为内置查找方法和CSS选择器两大类，覆盖所有查找场景。

1. find_all()：查找所有匹配节点

语法格式

soup.find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

参数说明

参数	类型	说明
`name`	字符串/列表/正则/函数	匹配节点名称（如`p`、`['p','a']`、`re.compile('^p')`）
`attrs`	字典	匹配节点属性（如`{'class':'content'}`）
`recursive`	布尔值	是否递归查找子节点（默认True，False仅查找直接子节点）
`text`	字符串/正则/列表/函数	匹配节点的文本内容
`limit`	整数	限制返回结果的数量（默认返回所有）
`**kwargs`	-	直接传入属性名（如`class_='content'`，注意class加下划线避免关键字冲突）

案例

import re

# 案例1：查找所有p标签
all_p = soup.find_all('p')
print("所有p标签：", [p.text for p in all_p])  # ['测试标题', '测试内容1', '测试内容2', '容器内的段落']

# 案例2：查找class为content的p标签（两种方式）
# 方式1：kwargs
content_p1 = soup.find_all('p', class_='content')
# 方式2：attrs字典
content_p2 = soup.find_all('p', attrs={'class': 'content'})
print("class=content的p标签：", [p.text for p in content_p1])  # ['测试内容1', '测试内容2']

# 案例3：查找href属性包含example的a标签（正则）
a_reg = soup.find_all('a', href=re.compile('example'))
print("href含example的a标签：", [a['href'] for a in a_reg])  # ['https://example1.com', 'https://example2.com']

# 案例4：查找文本包含"测试"的节点（不限标签）
text_node = soup.find_all(text=re.compile('测试'))
print("文本含测试的节点：", text_node)  # ['测试页面', '测试标题', '测试内容1', '测试内容2', '容器内的段落']

# 案例5：限制返回前2个p标签
limit_p = soup.find_all('p', limit=2)
print("前2个p标签：", [p.text for p in limit_p])  # ['测试标题', '测试内容1']

2. find()：查找第一个匹配节点

语法格式

soup.find(name=None, attrs={}, recursive=True, text=None, **kwargs)

（参数与find_all()完全一致，区别是仅返回第一个匹配节点，无limit参数）

案例

# 查找第一个class为title的p标签
first_title_p = soup.find('p', class_='title')
print("第一个title的p标签：", first_title_p.text)  # 测试标题

# 查找第一个a标签的href属性
first_a_href = soup.find('a')['href']
print("第一个a标签的href：", first_a_href)  # https://example1.com

3. 快捷查找方法（find_xxx系列）

BS4提供了一系列快捷方法，本质是find/find_all的封装，覆盖父子、兄弟节点查找场景：

方法名	作用	等价写法
`find_parent()`	查找直接父节点	`node.find_parents(limit=1)`
`find_parents()`	查找所有祖先节点	-
`find_next_sibling()`	查找下一个兄弟节点	`node.find_next_siblings(limit=1)`
`find_previous_sibling()`	查找上一个兄弟节点	`node.find_previous_siblings(limit=1)`
`find_next_siblings()`	查找后续所有兄弟节点	-
`find_previous_siblings()`	查找前面所有兄弟节点	-
`find_next()`	查找后续第一个匹配节点（不限层级）	`node.find_all_next(limit=1)`
`find_all_next()`	查找后续所有匹配节点	-
`find_previous()`	查找前面第一个匹配节点	`node.find_all_previous(limit=1)`
`find_all_previous()`	查找前面所有匹配节点	-
`find_child()`	查找直接子节点（BS4.7+）	`node.find_all(recursive=False, limit=1)`
`find_children()`	查找所有直接子节点（BS4.7+）	`node.find_all(recursive=False)`

案例

# 定位第一个content的p标签
content_p = soup.find('p', class_='content')

# 案例1：查找该节点的父节点（body）
parent_node = content_p.find_parent()
print("父节点名称：", parent_node.name)  # body

# 案例2：查找该节点的上一个兄弟节点（class=title的p标签）
prev_sibling = content_p.find_previous_sibling()
print("上一个兄弟节点文本：", prev_sibling.text)  # 测试标题

# 案例3：查找div#container的直接子节点（p标签）
container = soup.find('div', id='container')
children = container.find_children()
print("div的直接子节点文本：", [c.text for c in children])  # ['容器内的段落']

4. CSS选择器：select()/select_one()

通过CSS选择器语法查找节点，更贴近前端开发习惯，灵活性更高。

语法格式

# 返回所有匹配节点（列表）
soup.select(selector, limit=None, **kwargs)
# 返回第一个匹配节点
soup.select_one(selector, **kwargs)

常用CSS选择器规则

选择器类型	语法示例	说明
标签选择器	`p`	匹配所有p标签
类选择器	`.content`	匹配class=content的所有标签
ID选择器	`#container`	匹配id=container的标签
后代选择器	`div p`	匹配div下所有后代p标签
子选择器	`div > p`	匹配div的直接子节点p标签
属性选择器	`a[href]`	匹配有href属性的a标签
属性值匹配	`a[href*="example"]`	匹配href含example的a标签
多条件选择器	`p.title, p.content`	匹配class=title或content的p标签

案例

# 案例1：ID选择器（查找#container）
container = soup.select_one('#container')
print("ID=container的节点文本：", container.text.strip())  # 容器内的段落

# 案例2：类选择器（查找.all.content）
content_nodes = soup.select('.content')
print("class=content的节点文本：", [n.text for n in content_nodes])  # ['测试内容1', '测试内容2']

# 案例3：后代选择器（查找div下的p标签）
div_p = soup.select('div p')
print("div下的p标签文本：", [p.text for p in div_p])  # ['容器内的段落']

# 案例4：属性选择器（查找href以https开头的a标签）
https_a = soup.select('a[href^="https"]')
print("href以https开头的a标签：", [a['href'] for a in https_a])  # ['https://example1.com', 'https://example2.com']

# 案例5：限制返回前1个a标签
limit_a = soup.select('a', limit=1)
print("前1个a标签文本：", limit_a[0].text)  # 链接1

三、节点遍历方法

BS4将HTML解析为树形结构，可通过以下方法遍历节点的子节点、父节点、兄弟节点。

1. 遍历子节点

方法/属性	类型	说明
`contents`	列表	返回所有直接子节点（含文本/换行符）
`children`	迭代器	返回所有直接子节点（仅迭代，节省内存）
`descendants`	迭代器	返回所有子孙节点（递归遍历）

案例

# 定位div#container节点
container = soup.find('div', id='container')

# 案例1：contents（直接子节点，含换行符）
print("contents：", container.contents)  # ['\n', <p>容器内的段落</p>, '\n']

# 案例2：children（迭代直接子节点）
print("children遍历：")
for child in container.children:
    if child.name:  # 过滤换行符（无name属性）
        print(child.text)  # 容器内的段落

# 案例3：descendants（所有子孙节点）
print("descendants遍历：")
for desc in container.descendants:
    if desc.string:  # 有文本的节点
        print(desc.string.strip())  # 容器内的段落

2. 遍历父节点

方法/属性	类型	说明
`parent`	节点对象	返回直接父节点
`parents`	迭代器	返回所有祖先节点（直到根节点html）

案例

# 定位第一个a标签
a_node = soup.find('a')

# 案例1：parent（直接父节点body）
print("直接父节点：", a_node.parent.name)  # body

# 案例2：parents（所有祖先节点）
print("所有祖先节点：")
for parent in a_node.parents:
    if parent.name:
        print(parent.name)  # body → html → [document]

3. 遍历兄弟节点

方法/属性	类型	说明
`next_sibling`	节点/None	下一个兄弟节点（含换行符）
`previous_sibling`	节点/None	上一个兄弟节点（含换行符）
`next_siblings`	迭代器	后续所有兄弟节点
`previous_siblings`	迭代器	前面所有兄弟节点

案例

# 定位第一个content的p标签
content_p1 = soup.find('p', class_='content')

# 案例1：next_sibling（下一个兄弟节点，换行符）
print("下一个兄弟节点：", content_p1.next_sibling)  # '\n'

# 案例2：next_siblings（后续所有兄弟节点，过滤换行符）
print("后续兄弟节点：")
for sib in content_p1.next_siblings:
    if sib.name:
        print(sib.text)  # 测试内容2 → 链接1 → 链接2 → 容器内的段落

# 案例3：previous_sibling（上一个兄弟节点，换行符）
print("上一个兄弟节点：", content_p1.previous_sibling)  # '\n'

四、节点内容与属性操作

1. 获取/修改节点文本

方法/属性	说明
`string`	获取单个标签的文本（有子标签则返回None）
`text`/`get_text()`	获取标签下所有文本（含子标签，推荐）
`stripped_strings`	迭代器，返回所有文本并去除首尾空格

案例

# 定位title的p标签（含b子标签）
title_p = soup.find('p', class_='title')

# 案例1：string（有子标签，返回None）
print("string：", title_p.string)  # None
print("b标签的string：", title_p.b.string)  # 测试标题

# 案例2：text/get_text()（获取所有文本）
print("text：", title_p.text)  # 测试标题
print("get_text()：", title_p.get_text())  # 测试标题

# 案例3：stripped_strings（去空格迭代）
print("stripped_strings：")
for s in title_p.stripped_strings:
    print(s)  # 测试标题

# 案例4：修改文本
title_p.b.string = "新测试标题"
print("修改后文本：", title_p.text)  # 新测试标题

2. 获取/修改节点属性

方法/属性	说明
`attrs`	返回属性字典（如`{'class': ['title']}`）
`tag['attr']`	获取指定属性（无则抛KeyError）
`tag.get('attr')`	获取指定属性（无则返回None，推荐）
`has_attr('attr')`	判断是否有指定属性

案例

# 定位第一个a标签
a_node = soup.find('a')

# 案例1：获取属性（三种方式）
print("attrs字典：", a_node.attrs)  # {'href': 'https://example1.com', 'class': ['link']}
print("直接取值：", a_node['href'])  # https://example1.com
print("get方法：", a_node.get('class'))  # ['link']
print("get默认值：", a_node.get('target', '_blank'))  # _blank

# 案例2：判断属性是否存在
print("是否有href：", a_node.has_attr('href'))  # True

# 案例3：修改属性
a_node['href'] = 'https://new-example.com'
a_node['class'] = 'new-link'
print("修改后属性：", a_node.attrs)  # {'href': 'https://new-example.com', 'class': 'new-link'}

# 案例4：删除属性
del a_node['class']
print("删除后属性：", a_node.attrs)  # {'href': 'https://new-example.com'}

五、节点修改方法

BS4支持动态添加、删除、替换节点，适用于修改HTML结构。

方法	语法	说明
`append()`	`tag.append(new_tag/text)`	向标签末尾添加文本/子标签
`extend()`	`tag.extend([tag1, tag2])`	向标签末尾添加多个子标签
`insert()`	`tag.insert(index, new_tag)`	在指定位置插入子标签
`replace_with()`	`old_tag.replace_with(new_tag)`	替换节点
`unwrap()`	`tag.unwrap()`	移除标签（保留其内容）
`clear()`	`tag.clear()`	清空标签内所有内容
`extract()`	`tag.extract()`	移除节点并返回该节点（从树中删除）
`decompose()`	`tag.decompose()`	移除节点并销毁（无返回值）

案例

from bs4 import Tag, NavigableString

# 案例1：append添加文本/标签
container = soup.find('div', id='container')
container.append(NavigableString('新增文本'))  # 添加文本
new_p = soup.new_tag('p')  # 创建新标签
new_p.string = '新增段落'
container.append(new_p)  # 添加标签
print("append后：", container.text.strip())  # 容器内的段落新增文本新增段落

# 案例2：insert插入标签（索引1位置）
insert_p = soup.new_tag('p', class_='inserted')
insert_p.string = '插入的段落'
container.insert(1, insert_p)
print("insert后：", container.text.strip())  # 容器内的段落插入的段落新增文本新增段落

# 案例3：replace_with替换节点
old_p = soup.find('p', class_='title')
new_title = soup.new_tag('h1')
new_title.string = '新标题'
old_p.replace_with(new_title)
print("替换后标题：", soup.find('h1').text)  # 新标题

# 案例4：unwrap移除标签（移除b标签，保留文本）
h1 = soup.find('h1')  # 假设h1内有b标签（还原测试：h1.append(soup.new_tag('b', string='新标题'))）
h1.b.unwrap()
print("unwrap后：", h1.text)  # 新标题

# 案例5：extract移除节点
a2 = soup.find_all('a')[1]
extracted_a = a2.extract()
print("移除后剩余a标签：", [a.text for a in soup.find_all('a')])  # ['链接1']
print("被移除的a标签：", extracted_a.text)  # 链接2

# 案例6：clear清空内容
content_p = soup.find('p', class_='content')
content_p.clear()
print("clear后：", content_p.text)  # 空字符串

六、其他常用辅助方法

方法	语法	说明
`prettify()`	`soup.prettify()`	格式化输出HTML（带缩进，便于阅读）
`encode()`	`tag.encode('utf-8')`	将节点转换为字节串（可指定编码）
`decode()`	`tag.decode('utf-8')`	将字节串转换为字符串
`new_tag()`	`soup.new_tag('a')`	创建新标签（可指定属性：`new_tag('a', href='xxx')`）
`original_encoding`	`soup.original_encoding`	获取文档原始编码（自动检测）

案例

# 案例1：prettify格式化输出
print("格式化输出：")
print(soup.prettify())

# 案例2：encode/decode
a_node = soup.find('a')
byte_data = a_node.encode('utf-8')
print("字节串：", byte_data)  # b'<a href="https://new-example.com">链接1</a>'
str_data = byte_data.decode('utf-8')
print("字符串：", str_data)  # <a href="https://new-example.com">链接1</a>

# 案例3：创建新标签
new_a = soup.new_tag('a', href='https://test.com', class_='test-link')
new_a.string = '测试链接'
soup.body.append(new_a)
print("新增a标签：", soup.find('a', class_='test-link').text)  # 测试链接