python 爬虫（二）- 解析库的简单使用

最新推荐文章于 2021-01-05 21:04:12 发布

原创最新推荐文章于 2021-01-05 21:04:12 发布 · 282 阅读

0 ·

CC 4.0 BY-SA版权

python 专栏收录该内容

5 篇文章

订阅专栏

本文介绍使用正则表达式、lxml、BeautifulSoup、pyquery和JsonPath等工具抓取和解析网页数据的方法，演示了如何从HTML和JSON格式的网页中提取所需信息。

部署运行你感兴趣的模型镜像

当我们在获取到网页相应内容的时候，就会使用去解析它过滤得到想要的内容

正则re
lxml 库
Beautiful Soup
pyquery
JsonPath

示例响应内容

http://quotes.toscrape.com/ 截取部分内容，以下所有例子将以这个响应内容来示范，假设响应的内容字符串定义为一个变量 content

一、正则re

使用python 中内置的模块 re正则模块

如解析页面上所有的名人的名字：

import re
pat = re.compile('<small class="author" itemprop="author">(.*?)</small>')
print(pat.findall(content))

输出：[‘Albert Einstein’, ‘J.K. Rowling’, ‘Albert Einstein’, ‘Jane Austen’, ‘Marilyn Monroe’, ‘Albert Einstein’, ‘André Gide’, ‘Thomas A. Edison’, ‘Eleanor Roosevelt’, ‘Steve Martin’]

二、lxml库

lxml 支持xpath 的解析方式，那什么是xpath解析呢？

XPath 使用路径表达式来选取 XML 文档中的节点或节点集。节点是通过沿着路径 (path) 或者步 (steps) 来选取的。 xpath 解析方式

同样使用上面的例子，首先需要安装 lxml库

from lxml import etree

html = etree.HTML(content)
authors = html.xpath("//small[@class='author']//text()")
print(authors)

三、Beautiful Soup

BeautifulSoup也是Python的一个HTML或XML解析库，最主要的功能就是从网页爬取我们需要的数据。

首先需要安装 BeautifulSoup 解析器 pip install beautifulsoup4

from bs4 import BeautifulSoup

soup = BeautifulSoup(content,"lxml")
authors = soup.select('small.author')
for author in authors:
    print(author.get_text())

四、pyquery

pyquery语法与前端 jQuery的用法几乎一样

from pyquery import PyQuery as pq

doc = pq(content)
authors = doc('small.author')
for author in authors.items():
    print(author.text())

五、JsonPath

会使用jsonpath的地方，一般响应的内容是json数据。

语法：

XPath	JSONPath	Result
`/store/book/author`	`$.store.book[*].author`	the authors of all books in the store
`//author`	`$..author`	all authors
`/store/*`	`$.store.*`	all things in store, which are some books and a red bicycle.
`/store//price`	`$.store..price`	the price of everything in the store.
`//book[3]`	`$..book[2]`	the third book
`//book[last()]`	`$..book[(@.length-1)]` `$..book[-1:]`	the last book in order.
`//book[position()<3]`	`$..book[0,1]` `$..book[:2]`	the first two books
`//book[isbn]`	`$..book[?(@.isbn)]`	filter all books with isbn number
`//book[price<10]`	`$..book[?(@.price<10)]`	filter all books cheapier than 10
`//*`	`$..*`	all Elements in XML document. All members of JSON structure.

这里使用一段 json 数据

我们来获取所有的作者和所有价格

import jsonpath
import json

json_str = '''
{ "store": {
    "book": [ 
      { "category": "reference",
        "author": "Nigel Rees",
        "title": "Sayings of the Century",
        "price": 8.95
      },
      { "category": "fiction",
        "author": "Evelyn Waugh",
        "title": "Sword of Honour",
        "price": 12.99
      },
      { "category": "fiction",
        "author": "Herman Melville",
        "title": "Moby Dick",
        "isbn": "0-553-21311-3",
        "price": 8.99
      },
      { "category": "fiction",
        "author": "J. R. R. Tolkien",
        "title": "The Lord of the Rings",
        "isbn": "0-395-19395-8",
        "price": 22.99
      }
    ],
    "bicycle": {
      "color": "red",
      "price": 19.95
    }
  }
}
'''

jc = json.loads(json_str)

jp = jsonpath.jsonpath(jc, '$..author')
print(jp)
jp = jsonpath.jsonpath(jc, '$.store..price')
print(jp)