第十二节段 -- 爬虫03：【数据提取（pyquery；jsonpath）】

原创

于 2019-07-24 21:23:59 发布 · 924 阅读

·

0

·

CC 4.0 BY-SA版权

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文介绍了PyQuery库的基本使用方法，包括初始化、选择节点、获取属性和内容等，并通过实例展示了如何爬取网页数据。同时，文章还详细解释了JsonPath的概念、Python中的json模块用法及JsonPath的具体应用。

文章目录

1. pyquery
2. JosnPath

1. pyquery

1.1. 介绍 & 安装

如果你对CSS选择器与Jquery有所了解，那么还有个解析库可以适合你–Jquery官网](https://pythonhosted.org/pyquery/)https://pythonhosted.org/pyquery/

pip install pyquery

1.2. 使用方式

1. 初始化方式

字符串

from pyquery import PyQuery as pq
doc = pq(str)
print(doc(tagname))

url

from pyquery import PyQuery as pq
doc = pq(url='http://www.baidu.com')
print(doc('title'))

文件

from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
print(doc(tagname))

2. 选择节点

获取当前节点

from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
doc('#main #top')

获取子节点
- 在doc中一层层写出来
- 获取到父标签后使用children方法

from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
doc('#main #top').children()

获取父节点
- 获取到当前节点后使用parent方法
获取兄弟节点
- 获取到当前节点后使用siblings方法

3. 获取属性

from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
a = doc('#main #top')
print(a.attrib['href'])  #HTMLElement
print(a.attr('href')) #PyQuery

4. 获取内容

from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
div = doc('#main #top')
print(a.html())
print(a.text())

5. 样例

from pyquery import PyQuery as pq
# 1.可加载一段HTML字符串，或一个HTML文件，或是一个url地址，
d=pq("<html><title>hello</title></html>")
d=pq(filename=path_to_html_file)
d=pq(url='http://www.baidu.com')注意：此处url似乎必须写全
 
# 2.html()和text() ——获取相应的HTML块或文本块，
p=pq("<head><title>hello</title></head>")
p('head').html()#返回<title>hello</title>
p('head').text()#返回hello
 
# 3.根据HTML标签来获取元素，</

最低0.47元/天解锁文章

新学期VIP享超值加赠

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。