Beautiful Soup 解析数据用法

最新推荐文章于 2022-03-09 09:12:51 发布

weixin_33753003

最新推荐文章于 2022-03-09 09:12:51 发布

阅读量326

点赞数

CC 4.0 BY-SA版权

文章标签： python

原文链接：http://blog.51cto.com/zjunzz/2363235

本文介绍如何使用BeautifulSoup库解析HTML文档，包括安装方法、基本使用、标签查找、获取文字内容及使用CSS选择器等技巧。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1.简介

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。 Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。 Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

2.安装

下载地址:https://pypi.python.org/pypi/beautifulsoup4/4.3.2

官方文档：

http://beautifulsoup.readthedocs.org/zh_CN/latest

from bs4 import BeautifulSoup

我们创建一个字符串，后面的例子我们便会用它来演示

html = """<html><head><title>The Dormouse's story</title></head>

<body>

The Dormouse's story

Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.

...

"""

　　创建 beautifulsoup 对象

1	`soup` `=` `BeautifulSoup(html)`

　　下面我们来打印一下 soup 对象的内容，格式化输出

1	`print` `soup.prettify()`

3.1 找标签

直接打印标签

print soup.title

#<title>The Dormouse's story</title>

print soup.head

#<head><title>The Dormouse's story</title></head>

print soup.a

#<a class="sister" href="http://example.com/elsie" id="link1"></a>

print soup.p

#The Dormouse's story

我们可以利用 soup加标签名轻松地获取这些标签的内容，是不是感觉比正则表达式方便多了？不过有一点是，它查找的是在所有内容中的第一个符合要求的标签

对于标签，它有两个重要的属性，是 name 和 attrs，下面我们分别来感受一下

print soup.name

print soup.head.name

#[document]

#head

soup 对象本身比较特殊，它的 name 即为 [document]，对于其他内部标签，输出的值便为标签本身的名称

1 2	`print` `soup.p.attrs` `#{'class': ['title'], 'name': 'dromouse'}`

在这里，我们把 p 标签的所有属性打印输出了出来，得到的类型是一个字典。

如果我们想要单独获取某个属性，可以这样，例如我们获取它的 class 叫什么

1 2	`print` `soup.p[` `'class'` `]` `#['title']`

3.2 获取文字

既然我们已经得到了标签的内容，那么问题来了，我们要想获取标签内部的文字怎么办呢？很简单，用 .string 即可，例

1 2	`print` `soup.p.string` `#The Dormouse's story`

3.3 CSS选择器

在CSS中，标签名不加任何修饰，类名前加点，id名前加 #，在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list

3.3.1 通过标签名查找

1 2	`print` `soup.select(` `'title'` `)` `#[<title>The Dormouse's story</title>]`

3.3.2 通过类名查找

1 2	`print` `soup.select(` `'.sister'` `)` `#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]`

3.3.3 通过 id 名查找

1 2	`print` `soup.select(` `'#link1'` `)` `#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]`

3.3.4 组合查找

组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开

1 2	`print` `soup.select(` `'p #link1'` `)` `#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]`

3.3.5 直接子标签查找

1 2	`print` `soup.select(` `"head > title"` `)` `#[<title>The Dormouse's story</title>]`

3.3.6 属性查找

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到

print soup.select( 'a[class="sister"]' )

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.select( 'a[href="http://example.com/elsie"]' )

#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

同样，属性仍然可以与上述查找方式组合，不在同一节点的空格隔开，同一节点的不加空格

1 2	`print` `soup.select(` `'p a[href="http://example.com/elsie"]'` `)` `#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]`

转载于:https://blog.51cto.com/zjunzz/2363235