BeautifulSoup

最新推荐文章于 2024-01-11 09:21:20 发布

转载最新推荐文章于 2024-01-11 09:21:20 发布 · 477 阅读

文章标签：

#python

Request库同时被 3 个专栏收录

1 篇文章

订阅专栏

BeautifulSoup库

1 篇文章

订阅专栏

爬虫

1 篇文章

订阅专栏

本文介绍了如何使用Python的BeautifulSoup库解析HTML文档，包括创建BeautifulSoup对象、获取标签内容及属性、遍历文档树等基本操作，并展示了如何利用find_all和CSS选择器搜索文档。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

正则表达式的写法用得不熟练，叫Beautiful Soup，有了它我们可以很方便 地提取出HTML或XML标签中的内容

1. Beautiful Soup的简介

Beautiful Soup是python的一个库，最主要的功能是 从网页抓取数据

创建 Beautiful Soup 对象

创建一个字符串

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

创建 beautifulsoup 对象
soup = Beautiful(html)
还可以用本地 HTML 文件来创建对象
soup = BeautifulSoup(open('index.html'))

打印一下 soup 对象的内容，格式化输出
`print soup.prettify()

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:
1. Tag
就是 HTML 中的一个个标签
HTML 标签加上里面包括的内容就是 Tag

怎样用 Beautiful Soup 来方便地获取 Tags

print soup.title
#<title>The Dormouse's story</title>

#### soup加标签名轻松地获取这些标签的内容
#### 查找的是在所有内容中的第一个符合要求的标签，
如果要查询所有的标签，我们在后面进行介绍。

Tag，它有两个重要的属性，是 name 和 attrs，

name

print soup.name
print soup.head.name
#[document]
#head

soup 对象本身比较特殊，它的 name 即为 [document]，对于其他内部标签，输出的值便为标签本身的名称。

attrs

print soup.p.attrs
#{'class': ['title'], 'name': 'dromouse'}

p 标签的所有属性打印输出了出来，得到的类型是一个字典。

单独获取某个属性，

print soup.p['class']
#['title']

print soup.p.get('class')
#['title']

属性和内容等等进行修改

NavigableString(可以遍历的字符串)
已经得到了标签的内容

获取标签内部的文字怎么办呢

print soup.p.string
#The Dormouse's story

BeautifulSoup
BeautifulSoup 对象表示的是一个文档的全部内容.

分别获取它的类型，名称，以及属性来感受一下

print type(soup.name)
#<type 'unicode'>
print soup.name
# [document]
print soup.attrs
#{} 空字典

Comment

输出的内容仍然不包括注释符号

print soup.a
print soup.a.string
print type(soup.a.string)

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie
<class 'bs4.element.Comment'>

Elsie是注释，把注释符号去掉了

使用前最好做一下判断，判断代码如下

if type(soup.a.string)==bs4.element.Comment:
    print soup.a.string

遍历文档树

（1）直接子节点

.contents
将tag的子节点以列表的方式输出

<head>
  <title>
   The Dormouse's story
  </title>
 </head>


print soup.head.contents 
#[<title>The Dormouse's story</title>]

输出方式为列表
列表索引来获取它的某一个元素

print soup.head.contents[0]
#<title>The Dormouse's story</title>

.children
它返回的不是一个 list，不过我们可以通过遍历获取所有子节点。
list 生成器对象

<head>
  <title>
   The Dormouse's story
  </title>
 </head>


print soup.head.children
#<listiterator object at 0x7f71457f5710>

获得里面的内容呢？

for child in  soup.body.children:
    print child

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


<p class="story">...</p>

（2）所有子孙节点

.contents 和 .children 属性仅包含tag的直接子节点
.descendants 属性可以对所有tag的子孙节点进行递归循环，和 children类似，我们也需要遍历获取其中的内容。

for child in soup.descendants:
    print child

（3）节点内容

.string 属性

如果一个标签里面没有标签了，那么 .string 就会返回标签里面的内容。如果标签里面只有唯一的一个标签了，那么 .string 也会返回最里面的内容

<head>
  <title>
   The Dormouse's story
  </title>
 </head>



print soup.head.string
#The Dormouse's story
print soup.title.string
#The Dormouse's story

tag包含了多个子节点,tag就无法确定，string 方法应该调用哪个子节点的内容, .string 的输出结果是 None
print soup.html.string# None

（4）多个内容

.strings

获取多个内容，不过需要遍历获取

for string in soup.strings:
    print(repr(string))
    # u"The Dormouse's story"
    # u'\n\n'
    # u"The Dormouse's story"
    # u'\n\n'
    # u'Once upon a time there were three little sisters; and their names were\n'
    # u'Elsie'
    # u',\n'
    # u'Lacie'
    # u' and\n'
    # u'Tillie'
    # u';\nand they lived at the bottom of a well.'
    # u'\n\n'
    # u'...'
    # u'\n'

.stripped_strings

输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以 去除多余空白内容

for string in soup.stripped_strings:
    print(repr(string))
    # u"The Dormouse's story"
    # u"The Dormouse's story"
    # u'Once upon a time there were three little sisters; and their names were'
    # u'Elsie'
    # u','
    # u'Lacie'
    # u'and'
    # u'Tillie'
    # u';\nand they lived at the bottom of a well.'
    # u'...'

（5）父节点
#### .parent 属性

（6）全部父节点

.parents 属性

.parents 属性可以递归得到元素的所有父辈节点

 content = soup.head.title.string
for parent in  content.parents:
    print parent.name

title
head
html
[document]

（7）兄弟节点

兄弟节点可以理解为和本节点处在统一级的节点

.next_sibling 属性

获取了该节点的下一个兄弟节点

.previous_sibling

则与之相反
节点不存在，则返回 None

（8）全部兄弟节点

.next_siblings

.previous_siblings

对当前节点的兄弟节点迭代输出

for sibling in soup.a.next_siblings:
    print(repr(sibling))
    # u',\n'
    # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    # u' and\n'
    # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    # u'; and they lived at the bottom of a well.'
    # None

.next_element .previous_element 属性

不是针对于兄弟节点，而是在所有节点，不分层次,包括父节点

7.搜索文档树

（1）find_all( name , attrs , recursive , text , **kwargs )

搜索 当前tag的所有tag子节点,并判断是否符合过滤器的条件
参数
1）name 参数
查找 所有名字为 name 的tag,字符串对象会被自动忽略掉
A.传字符串

soup.find_all('b')
# [<b>The Dormouse's story</b>]

print soup.find_all('a')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

B.传正则表达式
Beautiful Soup会通过正则表达式的 match() 来匹配内容
以b开头的标签,这表示<body>和<b>标签都应该被找到

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

C.传列表
Beautiful Soup会将 与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和<b>标签

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

D.传 True

True 可以 匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点

E.传方法

2）keyword 参数

8.CSS选择器

写 CSS 时，标签名不加任何修饰，类名前加点，id名前加 #

soup.select()，返回类型是 list

（1）通过标签名查找

print soup.select('title') 
#[<title>The Dormouse's story</title>]

（2）通过类名查找

print soup.select('.sister')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

（3）通过 id 名查找

print soup.select('#link1')

#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

（4）组合查找

p 标签中，id 等于 link1的内容，二者需要用空格分开

print soup.select('p #link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

直接子标签查找

print soup.select("head > title")
#[<title>The Dormouse's story</title>]

（5）属性查找

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

print soup.select('a[class="sister"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.select('a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

不在同一节点的空格隔开，同一节点的不加空格

print soup.select('p a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 get_text() 方法来获取它的内容。

soup = BeautifulSoup(html, 'lxml')
print type(soup.select('title'))
print soup.select('title')[0].get_text()

for title in soup.select('title'):
    print title.get_text()