1.bs4库的简介
bs4的概念
- Beautiful Soup 是一个可以从 html 和 xml 中提取网页信息的库
bs4 的安装
- pip install lxml
- pip install bs4
- 安装 bs4 之前一定要先安装 lxml
2. bs4 的使用
2.1 bs4的快速入手
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# 一般导入bs4模块的方法
from bs4 import BeautifulSoup
# 获取bs对象
# soup = BeautifulSoup(html_doc) # GuessedAtParserWarning:so I'm using the best available HTML parser for this system ("lxml")
# bs4将网页字符串生成对象的时候需要用到解析器,就用 lxml 或者官方自带的解析器html.parser
soup = BeautifulSoup(html_doc, 'lxml')
# 美化页面代码
print(soup.prettify())
# 获取标签的内容(soup.标签)
print(soup.p)
# 获取标签的名字(soup.标签.name)
print(soup.p.name)
# 获取标签里面的内容(soup.标签.string)
print(soup.p.string)
2.2 bs4 的对象种类
- Tag:标签
- Beautiful Soup :bs4 对象
- Navigablestring:可导航字符串
- comment:注释
title_tag = soup.title # soup.标签
print(type(title_tag)) # <class 'bs4.element.Tag'>
print(type(soup)) # <class 'bs4.BeautifulSoup'>
title_tag = soup.title.string # soup.标签.string
print(type(title_tag)) # <class 'bs4.element.NavigableString'>
html = '''
<div><!-- --></div>
'''
bs = BeautifulSoup(html, 'lxml')
# 标签的内容为 html 注释
print(type(bs.div.string)) # <class 'bs4.element.Comment'>
3. 遍历
- HTML 是个树形结构,<> </> 构成了从属关系。对 HTML 遍历,有上行遍历、下行遍历和平行遍历三种途径和方法
3.1 下行遍历
- .contents 子节点的列表将该标签的所有儿子节点存入列表
- .children 子节点的迭代类型,与.content 类似,主要用于循环遍历子节点
- .descendants 子孙节点的迭代类型,包含所有子孙节点用于遍历
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
# .contents 子节点的列表将该标签的所有儿子节点存入列表
r = soup.head.contents # [<title>The Dormouse's story</title>]
r1 = soup.body.contents
print(r, r1)
# .children 子节点的迭代类型,与.content 类似,主要用于循环遍历子节点
# r_tag = soup.body.children
# print(r_tag) # <list_iterator object at 0x00000216C0EFE0B8>
r_tags = soup.body.children
for r_tag in r_tags:
print(r_tag)
# .descendants 子孙节点的迭代类型,包含所有子孙节点用于遍历(犹如洋葱层层蜕皮)
# r_tag = soup.head.descendants
# print(r_tag) # <generator object descendants at 0x0000025A674A0A40>
r_tags = soup.head.descendants
for r_tag in r_tags:
print(r_tag)
# <title>The Dormouse's story</title>
# The Dormouse's story
3.2 上行遍历
- .parent 节点的父节点
- .parents 节点的所有父节点
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
# .parent 节点的父节点
r_tag = soup.title.parent # <head><title>The Dormouse's story</title></head>
print(r_tag)
# .parents 节点的所有父节点
r_tags = soup.title.parents
for r_tag in r_tags:
print(r_tag)
print('-----------')
3.3 平行遍历
- next_sibling 下一个兄弟节点,按照HTML文本顺序返回
- previous_sibling 上一个兄弟节点,按照HTML文本顺序返回
- next_siblings 下一个所有兄弟节点,按照HTML文本顺序返回
- previous_siblings 上一个所有兄弟节点,按照HTML文本顺序返回
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# next_sibling 下一个兄弟节点,按照HTML文本顺序返回
html = '<a><b>afaef</b><c>faefaef</c></a>'
bs = BeautifulSoup(html, 'lxml')
# print(bs.prettify())
print(bs.b)
print(bs.b.next_sibling)
soup = BeautifulSoup(html_doc, 'lxml')
# 下一个所有兄弟节点,按照HTML文本顺序返回
# r_tag = soup.a
# print(r_tag)
# print(r_tag.next_siblings) # <generator object next_siblings at 0x000001C73A119BA0>
r_tags = soup.a.next_siblings
for r_tag in r_tags:
print(r_tag)
3.4 .string .strings .stripped strings
- string 获取标签里面的内容
- strings 返回一个生成器对象用来获取多个标签的内容
- stripped_strings 和 strings 基本一致,但是它可以去掉多余的空格
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# string 获取标签里面的内容
r_tag = soup.title.string
print(r_tag)
# strings 返回一个生成器对象用来获取多个标签的内容
r_tags = soup.strings
for r_tag in r_tags:
print(r_tag)
# stripped_strings 和 strings 基本一致,但是它可以去掉多余的空格
r_tags = soup.stripped_strings
for r_tag in r_tags:
print(r_tag)
4. 搜索树
- 字符串过滤器
- 列表过滤器
- 正则表达式过滤器
- True 过滤器
- 方法过滤器
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
# 字符串过滤器
t_tag1 = soup.a # 这里的 a 是一个Tag标签
t_tag = soup.find('a') # 'a'为字符串过滤器
t_Tag = soup.find_all('a') # 获取所有的 a
print(t_Tag)
# 列表过滤器
t_tag = soup.find_all(['a', 'head'])
print(t_tag)
# 自定义方法过滤器
def demo(tag):
return tag.has_attr('id')
print(soup.find_all(demo))
7 find() 和 find_all()
7.1 find_all()
- find_all()方法将以列表形式将搜索的标签数据返回
- find()方法将搜索到的第一条数据返回
- find_all()方法的参数
def find_all(self, name=None, attrs={}, recursive=True, text=None,
limit=None, **kwargs)
- name:标签的名称
- attrs :标签的属性
- recursive:是否递归,默认为True
- text:文本内容
- limit:限制搜索的条数
- **kwargs:不定长参数,以关键字传参
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
7.2 find_parent()、find_parents()、find_next_sibling()和find_next_siblings()
- find_parent():搜索个别父标签
- find_parents():搜索所有父标签
- find_next_sibling():搜索兄弟标签
- find_next_siblings():搜索所有兄弟标签
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
a_tag = soup.find('title')
print(a_tag.find_parent('head')) # <head><title>The Dormouse's story</title></head>
# find_parents返回一个列表
a_Tag = soup.find(text='Elsie')
print(a_Tag.find_parents('p'))
# [<p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>]
a_tag = soup.find('a')
print(a_tag.find_next_sibling()) # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
a_tag = soup.find('a')
print(a_tag.find_next_siblings()) # 返回一个列表
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
7.3 find_previous_siblings()、find_previous_sibling()、find_all_next()和find_next()
- find_previous_siblings():往上搜索所有兄弟标签
- find_previous_sibling():往上搜索兄弟标签
- find_all_next():往下搜索所有元素
- find_next():往下查找单个元素
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
# 查找往上兄弟标签
a_tag = soup.find(id='link3') # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
print(a_tag.find_previous_sibling())
# 查找往上所有兄弟标签
a_Tag = soup.find(id='link3')
print(a_Tag.find_previous_siblings())
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
# 查找以下所有元素
a_Tag = soup.find(id='link3')
print(a_Tag.find_all_next())
# [<p class="story">...</p>]
# 往下查找单个元素
a_tag = soup.p
print(a_tag.find_next('a'))
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
8. 修改文档树
- 修改标签的名称和属性
- 修改字符串 属性赋值,就相当于Python列表的append()方法
- decompose()修改删除起点,对于一些没有必要的或者重复的文章可以直接删除
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# def find_all(self, name=None, attrs={}, recursive=True, text=None,
# limit=None, **kwargs):
soup = BeautifulSoup(html_doc, 'lxml')
# 1.修改tag的名称和属性
a_tag = soup.find('a')
print(a_tag)
a_tag.name = 'w'
a_tag['id'] = '123'
a_tag['class'] = 'content'
print(a_tag)
# 2. 修改string
a_tag = soup.a
print(a_tag.string)
# Elsie
a_tag.string = 'hello'
print(a_tag.string)
# hello
# append()方法 向tag中添加内容与列表相似
a_tag = soup.a
print(a_tag)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
a_tag.append(' hello world')
print(a_tag)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie hello world</a>
# 3.decompose()删除多余的内容
a_Tag = soup.find(class_='title')
# print(a_Tag)
a_Tag.decompose()
print(soup)
# <html><head><title>The Dormouse's story</title></head>
# <body>
#
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# <p class="story">...</p>
# </body></html>
本文深入解析了bs4库的安装与使用方法,包括如何从HTML和XML中提取信息,详细介绍了bs4的各种对象类型及其操作,如Tag、NavigableString、Comment等。同时,文章还讲解了遍历HTML树的多种方式,如下行遍历、上行遍历和平行遍历,以及如何使用字符串过滤器、列表过滤器等进行树的搜索。此外,还涵盖了find_all、find_parent等方法的应用,以及如何修改文档树和删除不需要的内容。
5435

被折叠的 条评论
为什么被折叠?



