浅析Python爬虫之bs4库

本文深入解析了bs4库的安装与使用方法,包括如何从HTML和XML中提取信息,详细介绍了bs4的各种对象类型及其操作,如Tag、NavigableString、Comment等。同时,文章还讲解了遍历HTML树的多种方式,如下行遍历、上行遍历和平行遍历,以及如何使用字符串过滤器、列表过滤器等进行树的搜索。此外,还涵盖了find_all、find_parent等方法的应用,以及如何修改文档树和删除不需要的内容。

1.bs4库的简介

bs4的概念

  • Beautiful Soup 是一个可以从 html 和 xml 中提取网页信息的库

bs4 的安装

  • pip install lxml
  • pip install bs4
  • 安装 bs4 之前一定要先安装 lxml

2. bs4 的使用

2.1 bs4的快速入手

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
# 一般导入bs4模块的方法
from bs4 import BeautifulSoup
# 获取bs对象
# soup = BeautifulSoup(html_doc)  # GuessedAtParserWarning:so I'm using the best available HTML parser for this system ("lxml")
# bs4将网页字符串生成对象的时候需要用到解析器,就用 lxml 或者官方自带的解析器html.parser
soup = BeautifulSoup(html_doc, 'lxml')

# 美化页面代码
print(soup.prettify())

# 获取标签的内容(soup.标签)
print(soup.p)

# 获取标签的名字(soup.标签.name)
print(soup.p.name)

# 获取标签里面的内容(soup.标签.string)
print(soup.p.string)

2.2 bs4 的对象种类

  • Tag:标签
  • Beautiful Soup :bs4 对象
  • Navigablestring:可导航字符串
  • comment:注释
title_tag = soup.title  # soup.标签
print(type(title_tag))  # <class 'bs4.element.Tag'>

print(type(soup))  # <class 'bs4.BeautifulSoup'>

title_tag = soup.title.string  # soup.标签.string
print(type(title_tag))  # <class 'bs4.element.NavigableString'>

html = '''
<div><!-- --></div>
'''
bs = BeautifulSoup(html, 'lxml')
# 标签的内容为 html 注释
print(type(bs.div.string))  # <class 'bs4.element.Comment'>

3. 遍历

  • HTML 是个树形结构,<> </> 构成了从属关系。对 HTML 遍历,有上行遍历、下行遍历和平行遍历三种途径和方法

3.1 下行遍历

  • .contents 子节点的列表将该标签的所有儿子节点存入列表
  • .children 子节点的迭代类型,与.content 类似,主要用于循环遍历子节点
  • .descendants 子孙节点的迭代类型,包含所有子孙节点用于遍历
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
#  .contents  子节点的列表将该标签的所有儿子节点存入列表
r = soup.head.contents  # [<title>The Dormouse's story</title>]
r1 = soup.body.contents
print(r, r1)

# .children 子节点的迭代类型,与.content 类似,主要用于循环遍历子节点
# r_tag = soup.body.children
# print(r_tag)  # <list_iterator object at 0x00000216C0EFE0B8>

r_tags = soup.body.children
for r_tag in r_tags:
    print(r_tag)

# .descendants  子孙节点的迭代类型,包含所有子孙节点用于遍历(犹如洋葱层层蜕皮)
# r_tag = soup.head.descendants
# print(r_tag)  # <generator object descendants at 0x0000025A674A0A40>
r_tags = soup.head.descendants
for r_tag in r_tags:
    print(r_tag)
# <title>The Dormouse's story</title>
# The Dormouse's story

3.2 上行遍历

  • .parent 节点的父节点
  • .parents 节点的所有父节点
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')

# .parent 节点的父节点
r_tag = soup.title.parent  # <head><title>The Dormouse's story</title></head>
print(r_tag)

# .parents 节点的所有父节点
r_tags = soup.title.parents
for r_tag in r_tags:
    print(r_tag)
    print('-----------')
    

3.3 平行遍历

  • next_sibling 下一个兄弟节点,按照HTML文本顺序返回
  • previous_sibling 上一个兄弟节点,按照HTML文本顺序返回
  • next_siblings 下一个所有兄弟节点,按照HTML文本顺序返回
  • previous_siblings 上一个所有兄弟节点,按照HTML文本顺序返回
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
#  next_sibling  下一个兄弟节点,按照HTML文本顺序返回
html = '<a><b>afaef</b><c>faefaef</c></a>'
bs = BeautifulSoup(html, 'lxml')
# print(bs.prettify())
print(bs.b)
print(bs.b.next_sibling)


soup = BeautifulSoup(html_doc, 'lxml')
# 下一个所有兄弟节点,按照HTML文本顺序返回
# r_tag = soup.a
# print(r_tag)
# print(r_tag.next_siblings)  # <generator object next_siblings at 0x000001C73A119BA0>

r_tags = soup.a.next_siblings
for r_tag in r_tags:
    print(r_tag)

3.4 .string .strings .stripped strings

  • string 获取标签里面的内容
  • strings 返回一个生成器对象用来获取多个标签的内容
  • stripped_strings 和 strings 基本一致,但是它可以去掉多余的空格
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
# string   获取标签里面的内容
r_tag = soup.title.string
print(r_tag) 

# strings 返回一个生成器对象用来获取多个标签的内容
r_tags = soup.strings
for r_tag in r_tags:
    print(r_tag)
    
# stripped_strings  和 strings 基本一致,但是它可以去掉多余的空格
r_tags = soup.stripped_strings
for r_tag in r_tags:
    print(r_tag)

4. 搜索树

  • 字符串过滤器
  • 列表过滤器
  • 正则表达式过滤器
  • True 过滤器
  • 方法过滤器
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')

# 字符串过滤器
t_tag1 = soup.a  # 这里的 a 是一个Tag标签

t_tag = soup.find('a')  # 'a'为字符串过滤器
t_Tag = soup.find_all('a')  # 获取所有的 a
print(t_Tag)

# 列表过滤器
t_tag = soup.find_all(['a', 'head'])
print(t_tag)


# 自定义方法过滤器

def demo(tag):
    return tag.has_attr('id')

print(soup.find_all(demo))

7 find() 和 find_all()

7.1 find_all()

  • find_all()方法将以列表形式将搜索的标签数据返回
  • find()方法将搜索到的第一条数据返回
  • find_all()方法的参数
def find_all(self, name=None, attrs={}, recursive=True, text=None,
                 limit=None, **kwargs)
  • name:标签的名称
  • attrs :标签的属性
  • recursive:是否递归,默认为True
  • text:文本内容
  • limit:限制搜索的条数
  • **kwargs:不定长参数,以关键字传参
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""














7.2 find_parent()、find_parents()、find_next_sibling()和find_next_siblings()

  • find_parent():搜索个别父标签
  • find_parents():搜索所有父标签
  • find_next_sibling():搜索兄弟标签
  • find_next_siblings():搜索所有兄弟标签
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')

a_tag = soup.find('title')
print(a_tag.find_parent('head'))  # <head><title>The Dormouse's story</title></head>

# find_parents返回一个列表
a_Tag = soup.find(text='Elsie')
print(a_Tag.find_parents('p')) 
# [<p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>]


a_tag = soup.find('a')
print(a_tag.find_next_sibling())  # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

a_tag = soup.find('a')
print(a_tag.find_next_siblings())  # 返回一个列表
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


7.3 find_previous_siblings()、find_previous_sibling()、find_all_next()和find_next()

  • find_previous_siblings():往上搜索所有兄弟标签
  • find_previous_sibling():往上搜索兄弟标签
  • find_all_next():往下搜索所有元素
  • find_next():往下查找单个元素
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
# 查找往上兄弟标签
a_tag = soup.find(id='link3')  # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
print(a_tag.find_previous_sibling())

# 查找往上所有兄弟标签
a_Tag = soup.find(id='link3')
print(a_Tag.find_previous_siblings())
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

# 查找以下所有元素
a_Tag = soup.find(id='link3')
print(a_Tag.find_all_next())
# [<p class="story">...</p>]

# 往下查找单个元素
a_tag = soup.p
print(a_tag.find_next('a'))
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>


8. 修改文档树

  • 修改标签的名称和属性
  • 修改字符串 属性赋值,就相当于Python列表的append()方法
  • decompose()修改删除起点,对于一些没有必要的或者重复的文章可以直接删除
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
#     def find_all(self, name=None, attrs={}, recursive=True, text=None,
#                  limit=None, **kwargs):

soup = BeautifulSoup(html_doc, 'lxml')

# 1.修改tag的名称和属性

a_tag = soup.find('a')
print(a_tag)

a_tag.name = 'w'
a_tag['id'] = '123'
a_tag['class'] = 'content'
print(a_tag)

# 2. 修改string
a_tag = soup.a
print(a_tag.string)
# Elsie
a_tag.string = 'hello'
print(a_tag.string)
# hello

# append()方法 向tag中添加内容与列表相似

a_tag = soup.a
print(a_tag)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
a_tag.append(' hello world')
print(a_tag)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie hello world</a>

# 3.decompose()删除多余的内容
a_Tag = soup.find(class_='title')
# print(a_Tag)
a_Tag.decompose()
print(soup)
# <html><head><title>The Dormouse's story</title></head>
# <body>
# 
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# <p class="story">...</p>
# </body></html>

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值