Py3.x网络爬虫第三章更多数据提取方式-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_41998371/article/details/109555280

本文介绍使用Xpath、LXml、BeautifulSoup4和JsonPath等工具高效提取网页和JSON数据的方法，涵盖安装、使用示例及性能对比。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

第三章更多数据提取的方式

3.1Xpath和LXml

当正则表达式用的不好的时候，可以html文档转换成LXml，再用XPath语法查找信息

3.1.1XML

3.1.2XPath语法

更多Xpath语法请参考W3School

3.1.3LXml

1.安装
LXml是一个HTML/XML解析器，主要功能是如何解析和提取HTML/XML数据；

	pip install lxml -i https://pypi.tuna.tsinghua.edu.cn/simple

2.使用

# 将字符串解析为HTML文档

from lxml import etree
text = '''
<div>
    <ul>
        <li class="item-0"><a href="link1.html">first item</a></li>
        <li class="item-1"><a href="link2.html">second item</a></li>
        <li class="item-inactive"><a href="link3.html">third item</a>
    </li>
        <li class="item-1"><a href="link4.html">fourth item</a></li>
        <li class="item-0"><a href="link5.html">fifth item</a>
    </ul>
</div>
'''

# 利用etree.HTML，将字符串解析为HTML文档
html = etree.HTML(text)

# 按字符串序列化为HTML文档
result = etree.tostring(html).decode('utf-8')   

print(result)

# LX支持从文件中读取内容
# 新建hello.html存放在项目\data\中
<!-- hello.html -->
<div>
    <ul>
        <li class="item-0"><a href="link1.html">first item</a></li>
        <li class="item-1"><a href="link2.html">second item</a></li>
        <li class="item-inactive"><a href="link3.html">third item</a>
    </li>
        <li class="item-1"><a href="link4.html">fourth item</a></li>
        <li class="item-0"><a href="link5.html">fifth item</a>
    </ul>
</div>

#  读文件
from lxml import etree

# 读取外部文件hello.html
html = etree.parse('./data/hello.html')
# pretty_print=True表示格式化，比如左对齐和换行
result = etree.tostring(html, pretty_print=True).decode('utf-8')

print(result)

3.XPath实例测试

from lxml import etree

html = etree.parse('./data/hello.html')
# 1.获取所有li标签对象
# [<Element li at 0x22ee5b8b580>, <Element li at 0x22ee5b8b480>, <Element li at 0x22ee5b8b5c0>, <Element li at 0x22ee5b8b600>, <Element li at 0x22ee5b8b640>]
result = html.xpath('//li')
print('1')
print(result)

# 2.获取li标签的所有class属性
# ['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0']
result = html.xpath('//li/@class')
print('2')
print(result)

# 3.获取li标签下href为link.html的<a>标签
# [<Element a at 0x22ee5b8b680>]
result = html.xpath('//li/a[@href="link1.html"]')
print('3')
print(result)

# 4.获取li标签下的所有span标签
# [<Element span at 0x22ee5b8b940>]
result = html.xpath('//li//span')
print('4')
print(result)

# 5.获取<li>标签下的<a>标签里的所有class属性
# ['bold']
result = html.xpath('//li/a//@class')
print('5')
print(result)

# 6.获取最后一个<li>的<a>的href属性
# ['link5.html']
result = html.xpath('//li[last()]/a/@href')
print('6')
print(result)

# 7.获取倒数第二个li元素下的a标签中的文本
# ['fourth item']
result = html.xpath('//li[last()-1]/a/text()')
print('7')
print(result)

# 8.获取class值为bold的标签名
# span
result = html.xpath('//*[@class="bold"]')
print('8')
print(result[0].tag)

3.2 BeautifulSoup4

使用CSS选择器语法来提取HTML数据信息的方式

3.2.1安装与导入

# 安装
pip install beautifulsoup4
# 导入
 from bs4 import BeautifulSoup

3.2.2使用

基本用法

# 将字符串解析为HTML文档
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# 创建BeautifulSoup对象解析html，并使用lxml作为xml解析器
soup = BeautifulSoup(html, 'lxml')
# 格式化输出soup对象的内容
print(soup.prettify())

3.2.3详细用法

3.2.3.1 BeautifulSoup4四大对象种类

BeautifulSoup4将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:
（1）Tag
（2）NavigableString
（3）BeautifulSoup
（4）CommentTag类型

# BS4实例测试
from bs4 import BeautifulSoup
import re

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# 创建对象
soup = BeautifulSoup(html, 'lxml')

（1）Tag类型

Tag讲就是HTML中的一个个标签，对于Tag，它有两个重要的属性，是 name 和 attrs：

# 获取Tag对象title
print(soup.title)
# 获取Tag对象类型
print(type(soup.title))
# 获取Tag对象p
print(soup.p)
# 获取Tag对象p的名字
print(soup.p.name)
# 查询Tag对象p的所有属性，返回一个字典
print(soup.p.attrs)
# 查询Tag对象p的class属性，返回一个列表
print(soup.p.attrs['class'])
# 修改Tag对象p的属性或内容
soup.p['class'] = "newClass"
print(soup.a)
# 删除Tag对象p的属性
del soup.a['class']
print(soup.p)

（2）NavigableString类型

获取标签内部的文字获得标签内部文字

# 获取对象p内部的文本
print(soup.p.string)
print(type(soup.p.string))

（3）BeautifulSoup类型

表示的是一个文档的内容，大部分时候，可以把它当作 Tag 对象，是一个特殊的 Tag，我们可以分别获取它的类型，名称，以及属性。

# 类型
print(type(soup.name))
# 名称
print(soup.name)
# 属性
print(soup.attrs)

（4）Comment类型

是一个特殊类型的 NavigableString 对象，其输出的内容不包括注释符号

print(soup.a)  # 此时不能出现空格和换行符，a标签如下：
# <a class="mnav" href="http://news.baidu.com" name="tj_trnews"><!--新闻--></a>
print(soup.a.string) # 新闻
print(type(soup.a.string)) # <class 'bs4.element.Comment'>

3.2.3.2 遍历文档树

（1）.contents

获取Tag的所有子节点，返回一个list

# content属性得到body的所有子节点，返回一个列表
print(soup.body.contents)
# 用列表索引来获取它的某一个元素
print(soup.body.contents[1])

（2）.children

获取Tag的所有子节点，返回一个生成器

# children属性属性得到子节点，可迭代对象
print(soup.body.children)
# descendants属性属性得到子孙节点，可迭代对象
print(soup.body.descendants)

3.2.3.4 搜索文档树

find_all(name, attrs, recursive, text, **kwargs)

（1）name参数

字符串过滤

# 字符串过滤 name是按照标签名字查找
print(soup.find_all(name='b'))

列表过滤

#列表过滤 如果传入一个列表，BeautifulSoup4将会与列表中的任一元素匹配到的节点返回
print(soup.find_all(name=['a', 'b']))

正则表达式过滤

# 正则表达式过滤 如果传入的是正则表达式，那么BeautifulSoup4会通过search()来匹配内容
print(soup.find_all(name=re.compile("^b")))

方法过滤

# 传入一个方法根据方法来匹配
def name_is_exists(tag):
    return tag.has_attr("name")

t_list = soup.find_all(name_is_exists)
for item in t_list:
    print(item)

3.2.3.5 CSS选择器

select 查找所有符合要求的支持选择器
（1）通过标签名查找

   print(soup.select('title'))

（2）通过类名查找

   print(soup.select('.sister'))

（3）通过id查找

	print(soup.select('#link1'))

（4）组合查找

	print(soup.select('p #link1'))

（5）属性查找

	print(soup.select('a[class="sister"]'))

（6）获取内容

	print(soup.select('title')[0].get_text())

3.3JsonPath

XPath和BS4是提取HTML格式的数据信息，但是有些异步请求返回的是JSON格式的字符串，而我们有一下三种方式提取：
（1）使用正则表达式直接提取
（2）json模块将JSON格式的字符串转换成字典形式，通过键值对提取数据
（3）使用JsonPath直接提取

3.3.1 安装及验证执行效果

	pip install jsonpath -i

可以在http://jsonpath.com/站点进行验证JsonPath的执行效果

3.3.2 使用

JsonPath语法要点：

$ 表示文档的根元素
@ 表示文档的当前元素
.node_name 或 [‘node_name’] 匹配下级节点
[index] 检索数组中的元素
[start: end :step] 支持数组切片语法
星号作为通配符，匹配所有成员
… 子递归通配符，匹配成员的所有子元素
(< expr >) 使用表达式
?(< boolean expr>)进行数据筛选

XPath和JsonPath对比

XPath	JsonPath	说明
/	$	文档根元素
.	@	当前元素
/	.或[]	匹配下级元素
…	N/A	匹配上级元素，JsonPath不支持此操作符
//	…	递归匹配所有子元素
*	*	通配符，匹配下级元素
@	N/A	匹配属性，JsonPath不支持此操作符
[]	[]	下标运算符，根据索引获取元素，XPath索引从1开始，
\|	[,]	连接操作符，将多个结果拼接成数组返回，可以使用索引或别名
N/A	[start: end :step]	数据切片操作，XPath不支持
[]	?()	过滤表达式
N/A	()	脚本表达式，使用底层脚本引擎，XPath不支持
()	N/A	分组，JsonPath不支持

例程使用JsonPath提取数据

from jsonpath import jsonpath

json_str = {
    "store": {
        "book": [
            {"category": "reference",
             "author": "Nigel Rees",
             "title": "Sayings of the Century",
             "price": 8.95
             },
            {"category": "fiction",
             "author": "Evelyn Waugh",
             "title": "Sword of Honour",
             "price": 12.99
             },
            {"category": "fiction",
             "author": "Herman Melville",
             "title": "Moby Dick",
             "isbn": "0-553-21311-3",
             "price": 8.99
             },
            {"category": "fiction",
             "author": "J. R. R. Tolkien",
             "title": "The Lord of the Rings",
             "isbn": "0-395-19395-8",
             "price": 22.99
             }
        ],
        "bicycle": {
            "color": "red",
            "price": 19.95
        }
    }
}

# 所有书籍的作者
print(jsonpath(json_str, '$.store.book[*].author'))
# 所有作者
print(jsonpath(json_str, '$..author'))
# 商店里的所有东西
print(jsonpath(json_str, '$.store.*'))
# 商店里的所有价格
print(jsonpath(json_str, '$.store..price'))
# 第三本书
print(jsonpath(json_str, '$..book[2]'))
# 最后一本书
print(jsonpath(json_str, '$..book[(@.length-1)]'))
print(jsonpath(json_str, '$..book[-1:]'))
# 前两本书
print(jsonpath(json_str, '$..book[0,1]'))
print(jsonpath(json_str, '$..book[:2]'))
# 使用isbn属性过滤所有书籍
print(jsonpath(json_str, '$..book[?(@.isbn)]'))
# 过滤价格是10以上的书籍
print(jsonpath(json_str, '$..book[?(@.price<10)]'))
# 所有成员
print(jsonpath(json_str, '$..*'))

XPath	JsonPath	Result
/store/book/author	$.store.book[*].author	所有book的author节点
//author	$…author	所有author节点
/store/*	$.store.*	store下的所有节点，book数组和bicycle节点
/store//price	$.store…price	store下的所有price节点
//book[3]	$…book[2]	匹配第3个book节点
//book[last()]	$…book[(@.length-1)]，或 $…book[-1:]	匹配倒数第1个book节点
//book[position()< 3]	$…book[0,1]，或 $…book[:2]	匹配前两个book节点
//book[isbn]	$…book[?(@.isbn)]	过滤含isbn字段的节点
//book[price<10]	$…book[?(@.price<10)]	过滤price<10的节点
//*	$…*	递归匹配所有子节点

3.4 性能与选择

正则表达式是通用的，速度也是最快的，但是正则表达式书写难度大，一般用于简单的文本数据
XPath和BeautifulSoup4都可以用来提取HTML中的数据，XPath可以局部操作，而BS4会载入整个DOM文档，时间和性能要低于LXml，但是BeautifulSoup4可以使用选择器，语法相对简单
JsonPath只能提取符合JSON格式的数据