使用BeautifulSoup解析HTML和XML

最新推荐文章于 2025-05-16 23:50:22 发布

转载最新推荐文章于 2025-05-16 23:50:22 发布 · 7.9k 阅读

文章标签：

#BeautifulSoup #解析HTML #XML

Python 专栏收录该内容

51 篇文章

订阅专栏

本文介绍了如何使用BeautifulSoup库解析HTML文本，包括下载安装、基本使用、标签分析与获取、父子兄弟关系查找、标签条件查找及修改标签等功能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

使用BeautifulSoup解析HTML文本

一. 下载安装

下载地址: http://www.crummy.com/software/BeautifulSoup/ .

下载完后解压, cd到该目录, 输入命令: python setup.py install

测试:

#!/usr/bin/env python
# coding=utf-8
# Python 2.7.3
from bs4 import BeautifulSoup
soup2 = BeautifulSoup("<html>HTML Data.</html>")
print(soup2.html.text)
print(soup2.html)			# 整个标签页打印了出来

输出:

HTML Data

二. 基本使用

首先HTML文本的基本元素有标签, 标签属性, 和标签的内容, 还有就是树形结构. BeautifulSoup的结构就是基于这样HTML的树形结构.

2.1 HTML的分析与HTML各个元素的获取

例如:

#!/usr/bin/env python
# coding=utf-8
# Python 2.7.3
from bs4 import BeautifulSoup
soup2 = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup2.b
print(tag)				# 整个标签
print(tag.name)		# 标签名字
print(tag.text)			# 标签内容
print(tag['class'])		# 属性(标签属性是一个字典对象)

输出:

<b class="boldest">Extremely bold</b>

Extremely bold

[u'boldest']

2.2 获取标签的父子兄弟

要分析的HTML

<html>
    <head>
        <title>Page title</title>
    </head>
    <body>
        <p id="firstpara" align="center">
        This is paragraph<b>one</b>.
        </p>
        <p id="secondpara" align="blah">
        This is paragraph<b>two</b>.
        </p>
     </body>
</html>

#!/usr/bin/env python
# coding=utf-8
# Python 2.7.3
from bs4 import BeautifulSoup
soup2 = BeautifulSoup('<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is paragraph<b>one</b>.</p><p id="secondpara" align="blah">This is paragraph<b>two</b>.</p></body></html>')
htmlTag = soup2.html

# html有两个子标签head和body, head和body是兄弟
# contents是便签的子标签列表
print(htmlTag.contents[0])				# head contents的索引值不一定与HTML上的子标签顺序一致, 最好还是使用find
print(htmlTag.head)					# head
print(htmlTag.contents[1])				# body
print(htmlTag.body)					# body

headTag = htmlTag.head
bodyTag = htmlTag.body
print(headTag.parent)					# head的父亲是html
print(headTag.parents)					# head的父亲链
print(bodyTag.parents)					# body的父亲链
print(bodyTag.parent)					# body的父亲是html

print(headTag.nextSibling)				# head的下一个兄弟是body
print(headTag.previousSibling)			# head的前一个兄弟是"没有"(None)

2.3 查找标签

BeautifulSoup提供了find和find_all的方法进行查找. find只返回找到的第一个标签, 而find_all则返回一个列表.

#!/usr/bin/env python
# coding=utf-8
# Python 2.7.3
from bs4 import BeautifulSoup
soup2 = BeautifulSoup('<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is paragraph<b>one</b>.</p><p id="secondpara" align="blah">This is paragraph<b>two</b>.</p></body></html>')
htmlTag = soup2.html

# 查找所有p标签
print(soup2.find('p'))				# 找到第一个就返回
print(soup2.p)						# 找到第一个就返回(同上等价)

print(soup2.findAll('p'))				# 这里有两个
print(soup2('p'))					# 这里有两个(同上等价)

headTag = htmlTag.head

print(headTag.find_parents())			# 等价headTag.parents(返回父亲链)
print(headTag.find_parent())			# 等价headTag.parent
# 下面4个函数是在查找同一父节点上的兄弟
print(headTag.find_next_siblings())		# 等价headTag.nextSiblings
print(headTag.find_next_sibling())		# 等价headTag.nextSibling
print(headTag.find_previous_siblings())	# 等价headTag.nextSiblings
print(headTag.find_previous_sibling())	# 等价headTag.nextSibling
# 下面4个函数是在查找是在整个html的所有节点上查找的(不管父子关系)
print(headTag.find_all_next())			# 等价headTag.allNext
print(headTag.find_next())				# 等价headTag.next
print(headTag.find_all_previous())		# 等价headTag.allPrevious
print(headTag.find_previous())			# 等价headTag.previous

2.4 有条件的查找标签

看find的参数: find(name, attrs, recursive, text, **kwargs). 其他的查找函数与find函数相似, 看文档

1. 搜索tag

find(tagname)        # 直接搜索名为tagname的tag 如：find('head')
find(list)           # 搜索在list中的tag，如: find(['head', 'body'])
find(dict)           # 搜索在dict中的tag，如:find({'head':True, 'body':True})
find(re.compile('')) # 搜索符合正则的tag, 如:find(re.compile('^p')) 搜索以p开头的tag
find(lambda)         # 搜索函数返回结果为true的tag, 如:find(lambda name: if len(name) == 1) 搜索长度为1的tag
find(True)           # 搜索所有tag

(find_all也类似)

2. 搜索属性(attrs)

find(id='xxx')                                  # 寻找id属性为xxx的
find(attrs={id=re.compile('xxx'), algin='xxx'}) # 寻找id属性符合正则且algin属性为xxx的
find(attrs={id=True, algin=None})               # 寻找有id属性但是没有algin属性的