浅析Python爬虫之bs4库

最新推荐文章于 2024-12-21 10:37:44 发布

原创最新推荐文章于 2024-12-21 10:37:44 发布 · 3.1k 阅读

7 ·

CC 4.0 BY-SA版权

文章标签：

#python

本文深入解析了bs4库的安装与使用方法，包括如何从HTML和XML中提取信息，详细介绍了bs4的各种对象类型及其操作，如Tag、NavigableString、Comment等。同时，文章还讲解了遍历HTML树的多种方式，如下行遍历、上行遍历和平行遍历，以及如何使用字符串过滤器、列表过滤器等进行树的搜索。此外，还涵盖了find_all、find_parent等方法的应用，以及如何修改文档树和删除不需要的内容。

1.bs4库的简介

bs4的概念

Beautiful Soup 是一个可以从 html 和 xml 中提取网页信息的库

bs4 的安装

pip install lxml
pip install bs4
安装 bs4 之前一定要先安装 lxml

2. bs4 的使用

2.1 bs4的快速入手

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

# 一般导入bs4模块的方法
from bs4 import BeautifulSoup
# 获取bs对象
# soup = BeautifulSoup(html_doc)  # GuessedAtParserWarning：so I'm using the best available HTML parser for this system ("lxml")
# bs4将网页字符串生成对象的时候需要用到解析器，就用 lxml 或者官方自带的解析器html.parser
soup = BeautifulSoup(html_doc, 'lxml')

# 美化页面代码
print(soup.prettify())

# 获取标签的内容（soup.标签）
print(soup.p)

# 获取标签的名字（soup.标签.name）
print(soup.p.name)

# 获取标签里面的内容（soup.标签.string）
print(soup.p.string)

2.2 bs4 的对象种类

Tag：标签
Beautiful Soup ：bs4 对象
Navigablestring：可导航字符串
comment：注释

title_tag = soup.title  # soup.标签
print(type(title_tag))  # <class 'bs4.element.Tag'>

print(type(soup))  # <class 'bs4.BeautifulSoup'>

title_tag = soup.title.string  # soup.标签.string
print(type(title_tag))  # <class 'bs4.element.NavigableString'>

html = '''
<div><!-- --></div>
'''
bs = BeautifulSoup(html, 'lxml')
# 标签的内容为 html 注释
print(type(bs.div.string))  # <class 'bs4.element.Comment'>

3. 遍历

HTML 是个树形结构，<> </> 构成了从属关系。对 HTML 遍历，有上行遍历、下行遍历和平行遍历三种途径和方法

3.1 下行遍历

.contents 子节点的列表将该标签的所有儿子节点存入列表
.children 子节点的迭代类型，与.content 类似，主要用于循环遍历子节点
.descendants 子孙节点的迭代类型，包含所有子孙节点用于遍历

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
#  .contents  子节点的列表将该标签的所有儿子节点存入列表
r = soup.head.contents  # [<title>The Dormouse's story</title>]
r1 = soup.body.contents
print(r, r1)

# .children 子节点的迭代类型，与.content 类似，主要用于循环遍历子节点
# r_tag = soup.body.children
# print(r_tag)  # <list_iterator object at 0x00000216C0EFE0B8>

r_tags = soup.body.children
for r_tag in r_tags:
    print(r_tag)

# .descendants  子孙节点的迭代类型，包含所有子孙节点用于遍历（犹如洋葱层层蜕皮）
# r_tag = soup.head.descendants
# print(r_tag)  # <generator object descendants at 0x0000025A674A0A40>
r_tags = soup.head.descendants
for r_tag in r_tags:
    print(r_tag)
# <title>The Dormouse's story</title>
# The Dormouse's story

3.2 上行遍历

.parent 节点的父节点
.parents 节点的所有父节点

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')

# .parent 节点的父节点
r_tag = soup.title.parent  # <head><title>The Dormouse's story</title></head>
print(r_tag)

# .parents 节点的所有父节点
r_tags = soup.title.parents
for r_tag in r_tags:
    print(r_tag)
    print('-----------')

3.3 平行遍历

next_sibling 下一个兄弟节点，按照HTML文本顺序返回
previous_sibling 上一个兄弟节点，按照HTML文本顺序返回
next_siblings 下一个所有兄弟节点，按照HTML文本顺序返回
previous_siblings 上一个所有兄弟节点，按照HTML文本顺序返回

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
#  next_sibling  下一个兄弟节点，按照HTML文本顺序返回
html = '<a><b>afaef</b><c>faefaef</c></a>'
bs = BeautifulSoup(html, 'lxml')
# print(bs.prettify())
print(bs.b)
print(bs.b.next_sibling)


soup = BeautifulSoup(html_doc, 'lxml')
# 下一个所有兄弟节点，按照HTML文本顺序返回
# r_tag = soup.a
# print(r_tag)
# print(r_tag.next_siblings)  # <generator object next_siblings at 0x000001C73A119BA0>

r_tags = soup.a.next_siblings
for r_tag in r_tags:
    print(r_tag)

3.4 .string .strings .stripped strings

string 获取标签里面的内容
strings 返回一个生成器对象用来获取多个标签的内容
stripped_strings 和 strings 基本一致，但是它可以去掉多余的空格

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
# string   获取标签里面的内容
r_tag = soup.title.string
print(r_tag) 

# strings 返回一个生成器对象用来获取多个标签的内容
r_tags = soup.strings
for r_tag in r_tags:
    print(r_tag)
    
# stripped_strings  和 strings 基本一致，但是它可以去掉多余的空格
r_tags = soup.stripped_strings
for r_tag in r_tags:
    print(r_tag)

4. 搜索树

字符串过滤器
列表过滤器
正则表达式过滤器
True 过滤器
方法过滤器

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')

# 字符串过滤器
t_tag1 = soup.a  # 这里的 a 是一个Tag标签

t_tag = soup.find('a')  # 'a'为字符串过滤器
t_Tag = soup.find_all('a')  # 获取所有的 a
print(t_Tag)

# 列表过滤器
t_tag = soup.find_all(['a', 'head'])
print(t_tag)


# 自定义方法过滤器

def demo(tag):
    return tag.has_attr('id')

print(soup.find_all(demo))

7 find（）和 find_all（）

7.1 find_all（）

find_all（）方法将以列表形式将搜索的标签数据返回
find（）方法将搜索到的第一条数据返回
find_all（）方法的参数

def find_all(self, name=None, attrs={}, recursive=True, text=None,
                 limit=None, **kwargs)

name：标签的名称
attrs ：标签的属性
recursive：是否递归，默认为True
text：文本内容
limit：限制搜索的条数
**kwargs：不定长参数，以关键字传参

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

7.2 find_parent（）、find_parents()、find_next_sibling()和find_next_siblings()

find_parent()：搜索个别父标签
find_parents()：搜索所有父标签
find_next_sibling()：搜索兄弟标签
find_next_siblings()：搜索所有兄弟标签

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')

a_tag = soup.find('title')
print(a_tag.find_parent('head'))  # <head><title>The Dormouse's story</title></head>

# find_parents返回一个列表
a_Tag = soup.find(text='Elsie')
print(a_Tag.find_parents('p')) 
# [<p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>]


a_tag = soup.find('a')
print(a_tag.find_next_sibling())  # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

a_tag = soup.find('a')
print(a_tag.find_next_siblings())  # 返回一个列表
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

7.3 find_previous_siblings()、find_previous_sibling()、find_all_next()和find_next()

find_previous_siblings()：往上搜索所有兄弟标签
find_previous_sibling（）：往上搜索兄弟标签
find_all_next()：往下搜索所有元素
find_next()：往下查找单个元素

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
# 查找往上兄弟标签
a_tag = soup.find(id='link3')  # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
print(a_tag.find_previous_sibling())

# 查找往上所有兄弟标签
a_Tag = soup.find(id='link3')
print(a_Tag.find_previous_siblings())
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

# 查找以下所有元素
a_Tag = soup.find(id='link3')
print(a_Tag.find_all_next())
# [<p class="story">...</p>]

# 往下查找单个元素
a_tag = soup.p
print(a_tag.find_next('a'))
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

8. 修改文档树

修改标签的名称和属性
修改字符串属性赋值，就相当于Python列表的append（）方法
decompose（）修改删除起点，对于一些没有必要的或者重复的文章可以直接删除

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
#     def find_all(self, name=None, attrs={}, recursive=True, text=None,
#                  limit=None, **kwargs):

soup = BeautifulSoup(html_doc, 'lxml')

# 1.修改tag的名称和属性

a_tag = soup.find('a')
print(a_tag)

a_tag.name = 'w'
a_tag['id'] = '123'
a_tag['class'] = 'content'
print(a_tag)

# 2. 修改string
a_tag = soup.a
print(a_tag.string)
# Elsie
a_tag.string = 'hello'
print(a_tag.string)
# hello

# append()方法 向tag中添加内容与列表相似

a_tag = soup.a
print(a_tag)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
a_tag.append(' hello world')
print(a_tag)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie hello world</a>

# 3.decompose()删除多余的内容
a_Tag = soup.find(class_='title')
# print(a_Tag)
a_Tag.decompose()
print(soup)
# <html><head><title>The Dormouse's story</title></head>
# <body>
# 
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# <p class="story">...</p>
# </body></html>