Python+requests之beautifulsoup4解析html

最新推荐文章于 2025-01-29 21:28:31 发布

SitVen

最新推荐文章于 2025-01-29 21:28:31 发布

阅读量625

点赞数

分类专栏： Requests接口自动化

本文链接：https://blog.youkuaiyun.com/weixin_43507959/article/details/108561607

版权

Requests接口自动化专栏收录该内容

14 篇文章

订阅专栏

本文介绍了Python库Beautiful Soup4用于HTML解析的基本用法，包括解析器的选择、安装、对象类型以及搜索文档树的方法，如`find_all`和`find`，并提供了实例演示。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库, 它能够通过你喜欢的转换器实现惯用的文档导航, 查找修改文档

html解析器

下表列出了主要的html解析器，以及它们的优缺点

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, "html.parser")	1. Python的内置标准库 2. 执行速度适中 3. 文档容错能力强	1. Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	BeautifulSoup(markup, "lxml")	1. 速度快 2. 文档容错能力强	1. 需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, ["lxml-xml"]) BeautifulSoup(markup, "xml")	1. 速度快 2. 唯一支持XML的解析器	1. 需要安装C语言库
html5lib	BeautifulSoup(markup, "html5lib")	1. 最好的容错性 2. 以浏览器的方式解析文档 3. 生成HTML5格式的文档	1. 速度慢 2. 不依赖外部扩展

我们对执行速度没啥很大要求，所以主要用第一个html.parser，Python 的标准库可直接用，其它几个需要安装对应解析器

Beautiful Soup4

安装

# 通过pip安装
pip install beautifulsoup4

使用

将一段文档(html字符串或一个文件句柄)传入BeautifulSoup的构造方法, 即可转换成一个复杂的树形结构, html.parser解析器会将其每个子节点解析为Python对象

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html" 'rb').read(), 'html.parser')    # 通过html文件句柄
soup = BeautifulSoup("<html>data</html>", 'html.parser')               # 通过html字符串

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构, 每个节点都是Python对象, 所有对象可以归纳为 4 种：

1. Tag: 标签对象，与html 原生文档中的 tag 相同，如：

soup = BeautifulSoup('<b id="sitven" class="title">sitven的博客</b>', 'html.parser')
tag = soup.b        # 如果不存在则返回None，如果存在多个则返回第一个
print(type(tag))
# 打印结果: <class 'bs4.element.Tag'>

Name: 每个 tag 都有自己的名字

tag = soup.b
print(tag.name)
# 打印结果
# u'b'

一个tag可能有很多个属性<b class="boldest"> 有一个 “class” 的属性,值为"boldest"，tag的属性的操作方法与字典相同:

print(tag["id"])
# 打印结果
#u'sitven'

也可以直接”点”取属性, 比如: .attrs :

print(tag.attrs)
# 打印结果
# {'id': 'sitven', 'class': ['title']}

tag的属性可以被添加,删除或修改.tag的属性操作方法与字典一样：

# 增
tag['value'] = '张大款'
print(tag)
# 打印结果:   <b class="title" id="sitven" value="张大款">sitven的博客</b>
    
# 改
tag['calss'] = 'zdk'
print(tag)
# 打印结果:   <b class="zdk" id="sitven" value="张大款">sitven的博客</b>
    
# 删
del tag['class']
print(tag)
# 打印结果:   <b id="sitven" value="张大款">sitven的博客</b>

# key不存在
tag['class']
# 打印结果:   KeyError: 'class'

print(tag.get('class'))
# 打印结果:   None

2. NavigableString: 字符对象

字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串

b = '<b id="sitven" class="title" >sitven的博客</b>'
soup = BeautifulSoup(b, 'html.parser')
tag = soup.b
print(tag.string)
# 打印结果:  sitven的博客

3. BeautifulSoup

BeautifulSoup对象表示的是文档的全部内容, 大部分时候可把它当作Tag对象, 它支持遍历文档树和搜索文档树中描述的大部分方法

4. Comment: 注释对象。 Tag , NavigableString , BeautifulSoup 几乎覆盖了html和xml中的所有内容, 但是还有一些特殊对象 - 注释部分

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
print(type(comment))

# 打印结果：   <class 'bs4.element.Comment'>

但是当它出现在HTML文档中时, Comment 对象会使用特殊的格式输出:

print(soup.b.prettify())
# <b>
#  <!--Hey, buddy. Want to buy a used parser?-->
# </b>

搜索文档树

Beautiful Soup定义了很多搜索方法，常用的两个: find() 和 find_all()

以“爱丽丝”文档作为例子-html_doc.html：

<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>

find_all() 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件, 返回文档中符合条件的所有tag，返回list

格式: find_all(name, attrs={}, recursive, string, **kwargs)

find() 方法搜索当前tag的所有tag子节点, 并判断是否符合过滤器的条件, 返回文档中符合条件某个tag, 直接返回结果或者None

格式：find(name, attrs={}, recursive, string, **kwargs)

参数name：查找所有名字为name的tag (比如：soup.find_all(name="title")，搜索name为title的所有tag), 接受字符串 , 正则表达式 , 列表, True等类型参数

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("html_doc.hmtl", 'rb').read(), 'html.parser')
print(soup.find_all("title"))
# 打印结果: [<title>The Dormouse's story</title>]

参数attrs: 通过attrs字典中定义的属性来搜索tag(如果包含一个名字为 id 的参数, Beautiful Soup会搜索每个tag的'id'属性)

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("html_doc.hmtl", 'rb').read(), 'html.parser')
# 搜索id="link2”且class="sister的tag
print(soup.find_all(attrs={"id": "link2”, "class": "sister"})
# 打印结果: [<title>The Dormouse's story</title>]

参数string: 搜索文档中的字符串内容, 与name参数的可选值一样, 参数string接受字符串 , 正则表达式 , 列表, True等类型参数

例子：与参数name联和使用，搜索内容里面包含"Lacie"的<a>标签

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("html_doc.hmtl", 'rb').read(), 'html.parser')
soup = Get_Html(html=html_doc)
a = soup.find_all_label(name="a", string="Lacie")
print(a)
# 打印结果： [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

例子：正则表达式匹配

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("html_doc.hmtl", 'rb').read(), 'html.parser')
soup = Get_Html(html=html_doc)
soup.find_all(string=re.compile("Dormouse"))
# 打印结果:  [u"The Dormouse's story", u"The Dormouse's story"]

参数limit：与SQL中的limit关键字类似,当搜索到的结果数量达到 limit 的限制时,就停止搜索返回结果

例子：文档树中有3个tag符合搜索条件,但结果只返回了2个,因为我们限制了返回数量

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("html_doc.hmtl", 'rb').read(), 'html.parser')
soup = Get_Html(html=html_doc)
print(soup.find_all("a", limit=2))
# 打印结果: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

参数recursive: 调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False