python3.6BeautifulSoup初学（一）

最新推荐文章于 2020-12-16 07:49:35 发布

zhangziyuan123

最新推荐文章于 2020-12-16 07:49:35 发布

阅读量327

点赞数

CC 4.0 BY-SA版权

文章标签：【python】

本文链接：https://blog.youkuaiyun.com/W45DB041/article/details/81638607

本文介绍如何使用Python的BeautifulSoup库解析HTML文档。通过实例演示了如何获取标签名称、属性和文本内容，以及如何查找特定标签和提取链接。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1.BeautifuSoup能够最html、xml等文档进行解析方便获取网页信息，下面针对一小段的html文档应用BeautifulSoup进行解析：

具体的html代码如下：

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<body>
<p class= "title"><b>The Dormouse's story</b></p>

<p class= "story">Once upon a time there were three little sisters;and their name were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""
from bs4 import BeautifulSoup

soup= BeautifulSoup(html_doc,'html.parser')

print(soup.prettify())

2.运行上述Python的代码之后可以再Python的shell运行界面对产生的soup变量进行操作，获取对html_doc变量的解析结果：

================ RESTART: E:/python_program/Dormouse_story.py ================
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters;and their name were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
>>> 
>>> 
>>> 
>>> soup.title()
[]
>>> soup.title
<title>The Dormouse's story</title>
>>> soup.title.name
'title'
>>> 
>>> soup.title.string
"The Dormouse's story"
>>> 
>>> soup.title.parent.name
'head'
>>> 
>>> soup.p
<p class="title"><b>The Dormouse's story</b></p>
>>> 

>>> soup.class
SyntaxError: invalid syntax
>>> soup.p[class]
SyntaxError: invalid syntax

>>> soup.p
<p class="title"><b>The Dormouse's story</b></p>

>>> soup.p['class']
['title']

>>> soup.p['class'][0]
'title'

>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
 
#从文档中找到所有<a>标签的链接:
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


>>> soup.find(id='link3')
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters;and their name were

Elsie,
Lacieand
Tillie;
and they lived at the bottom of a well.
...

3.从中可以看出：使用BeautifulSoup构造的soup变量当使用soup.tag_name.name时返回的是tag的名字，当使用soup.tag_name时返回的对应标签的内容，同时能够使用get_text()函数对使用将html语言编写的文档中的出去标签部分的字符串部分提取出来。

4.<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>可以起到对Elsie设置超链接的作用，在网页中审查元素也可以看到类似的定义