Xpath 和 BeautifulSoup4区别对比

最新推荐文章于 2025-06-02 12:41:17 发布

原创最新推荐文章于 2025-06-02 12:41:17 发布 · 置顶 · 4.6k 阅读

15 ·

CC 4.0 BY-SA版权

python爬虫专栏收录该内容

19 篇文章

订阅专栏

本文深入探讨了XPath与BeautifulSoup在网页数据抓取中的应用，详细讲解了两种技术的特性，包括XPath的节点选取、属性获取及遍历策略，以及BeautifulSoup的标签筛选与文本提取方法。对比了两者的效率与适用场景，为爬虫开发者提供了实用的指导。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

XPath

1. 永远返回一个列表：有数据的列表或空列表

2. XPath匹配时，下标从 1 开始

3. XPath取值的目标值两种：
-1. 指定标签的文本内容（如取文本）
-2. 指定标签的指定属性值（如取链接）

XPath取出的字符串数据，都是Unicode编码字符串。

4. 如果取值的目标值很多，可以先获取所有结点列表，再迭代取值：
获取结点列表

last() : 从后往前取值

//div[@id=“page”]/a[last()-3]

position():指定范围

//div[@id=“page”]/a[position()>4]

node_list = "//div[@class='f18 mb20']"

for node in node_list:
  item = {}
  item['text'] = " ".join(ode.xpath("./text()"))
  item['a_text'] = node.xpath("./a/text()")[0]
  item['link'] = node.xpath("./a/@href")[0]

html = response.read()
html = response.content

#导入lxml类库里的 etree模块
from lxml import etree
 通过 etree模块的 HTML类 获取 HTML DOM对象
html_obj = etree.HTML(html)
 html_obj = etree.parse("./baidu.html")
 html = etree.tostring(html_obj)

node_list = html_obj.xpath("//div[@class='f18 mb20']/a/@href")

BeautifulSoup4 的常用匹配方法：

1. find() : 匹配网页中第一个符合规则的结果，并返回该结果
2. find_all() ：匹配网页中所有符合规则的结果，并返回结果列表
find() 和 find_all() 语法相同
3. select() ：匹配网页中所有符合规则的结果，并返回结果列表（使用CSS选择器用法）

url = "https://hr.tencent.com/position.php?&start=0" += 10


item_list = []
node_list = soup.find_all("tr", {"class" : ["even", "odd"]})

for node in node_list:
    item = {}
    item['position_name'] = node.find_all("td")[0].a.text
    item['position_link'] = node.find_all("td")[0].a.get("href")
    item['position_type'] = node.find_all("td")[1].text
    item['people_number'] = node.find_all("td")[2].text
    item['work_location'] = node.find_all("td")[3].text
    item['publish_times'] = node.find_all("td")[4].text
    item_list.append(item)

Xpath 和bs4使用对比:

import requests
from lxml import etree
from bs4 import BeautifulSoup
url = "https://hr.tencent.com/position.php?&start=10"
headers = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"}
html = requests.get(url, headers=headers).content

html_obj = etree.HTML(html)
html_obj.xpath("//tr[@class='even']")
html_obj.xpath("//tr[@class='odd']")
html_obj.xpath("//tr[@class='even'] | //tr[@class='odd']")

soup = BeautifulSoup(html, "lxml")
soup.find_all("tr")
# 找出所有的tr
len(soup.find_all("tr"))
# 找出所有指定属性的 tr
len(soup.find_all("tr", {"class" : "even"}))
len(soup.find_all("tr", {"class" : "odd"}))
len(soup.find_all("tr", {"class" : ["even", "odd"]}))

# 找出所有指定属性的 tr 和tmm，属性相同
len(soup.find_all(["tr", "tmm"], {"class" : ["even", "odd"]}))
# 根据属性查找所有指定的标签
len(soup.find_all(attrs={"class" : ["even", "odd"]}))
# 根据class属性超找所有指定的标签
len(soup.find_all(class_ = ["even", "odd"]))

# 找出所有class为 even 和 odd 的标签
len(soup.select(".even"))
len(soup.select(".even, .odd"))
len(soup.select("[class='even'], [class='odd']"))

bs4提取文本和属性值:

import requests
from bs4 import BeautifulSoup
url = "https://hr.tencent.com/position.php?&start=10"
headers = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"}
html = requests.get(url, headers=headers).content
soup = BeautifulSoup(html, "lxml")

node_list = soup.find_all("tr", {"class" : ["even", "odd"]})

node_list[0].td
node_list[0].find_all("td")
node_list[0].select("td")

node_list[0].select("td")[0]
node_list[0].select("td")[0].a

node_list[0].select("td")[0].a.string
node_list[0].select("td")[0].a.text
node_list[0].select("td")[0].a.get_text()

node_list[0].select("td")[0].a.get("href")
node_list[0].select("td")[0].a.attrs
node_list[0].select("td")[0].a.attrs["href"]