xpath基础语法

最新推荐文章于 2025-04-30 20:07:51 发布

原创最新推荐文章于 2025-04-30 20:07:51 发布 · 211 阅读

0 ·

CC 4.0 BY-SA版权

语法专栏收录该内容

1 篇文章

订阅专栏

本文深入讲解了XPath选择器的使用方法，通过实例演示如何在lxml环境下解析HTML文档，包括定位特定元素、获取属性和文本、多条件查询等技巧。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

# xpath选择器的使用

html_str = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>xpath测试网页</title>
</head>
<body>
    <a href="http://www.baidu.com" class="first second third">百度一下</a>
    <div>
        这是一个块标签
        <a href="http://www.taobao.com" class="second one three">淘宝网</a>
        <a href="http://www.qq.com" class="third">QQ网</a>
        <a href="http://www.tencent.com" id="one" name="aa">腾讯网</a>
        <section>
            <p id="content_id_1234455">一段文本</p>
            <p id="content_id_1232w55">二段文本</p>
            <p id="content_id_122455">三段文本</p>
            <p id="123_haha">11段文本</p>
            <p id="333_haha">22段文本</p>
        </section>
        <span>
            <a>666</a>
            哈哈哈
            <p>999</p>
        </span>
    </div>
</body>
</html>
"""

# xpath也是基于lxml实现对网页源代码的解析
from lxml.html import etree

# 将html源代码，解析成为一个文档树对象。
# parser: 给这次解析，通过一些配置。
obj = etree.HTML(html_str, parser=etree.HTMLParser(encoding='utf8'))
# <class 'lxml.etree._Element'>
# print(type(obj))
# <Element html at 0x1d35c060208>
# print(obj)

# 获取title标签
# //：xpath路径语法，表示在obj中的任意位置查找title标签。
title = obj.xpath('//title/text()')
print(title)

# 获取div下id='one'的a标签
# /: 表示div内部的直接子标签，不包含后代标签。
# xpath()方法返回的是一个list列表
a_ele = obj.xpath('//div/a[@id="one"]')[0]
print(a_ele)
# 在a这个Element对象的基础上，再获取href属性及文本内容。
# ./: 表示在当前元素上，查找内容
txt = a_ele.xpath('./text()')[0]
href = a_ele.xpath('./@href')[0]
name = a_ele.xpath('./@name')[0]
# name = a_ele.xpath('./@id')
# name = a_ele.xpath('./@href')
# name = a_ele.xpath('./@class')
print('=======',href, txt, name)

# a_ele = obj.xpath('//div/a[@id="one"]/@href')[0]

# 获取class属性值包含多个的标签
a = obj.xpath('//div/a[@class="second one three"]/text()')[0]
# print(a)

# contains()用来查找属性中包含某一个值。
a = obj.xpath('//div/a[contains(@class, "second")]/text()')[0]
# print(a)

# startswith()/endswith()
p = obj.xpath('//div/section/p[starts-with(@id, "content_id_")]/text()')
print(p)

# p = obj.xpath('//div/section/p[ends-with(@id, "_haha")]/text()')
# print(p)

# 获取div下所有标签的文本内容
# a = obj.xpath('//div//text()')
# print(a)

# p_obj = obj.xpath('//div/section/p[5]/text()')
p_obj = obj.xpath('//div/section/p[last()]/text()')
print(p_obj)

# 支持and/or多条件查询
a = obj.xpath('//div/a[@id="one" and @name="aa"]/@href')
print(a)