python3.6BeautifulSoup初学(一)

本文介绍如何使用Python的BeautifulSoup库解析HTML文档。通过实例演示了如何获取标签名称、属性和文本内容,以及如何查找特定标签和提取链接。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1.BeautifuSoup能够最html、xml等文档进行解析方便获取网页信息,下面针对一小段的html文档应用BeautifulSoup进行解析:

具体的html代码如下:

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<body>
<p class= "title"><b>The Dormouse's story</b></p>

<p class= "story">Once upon a time there were three little sisters;and their name were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""
from bs4 import BeautifulSoup

soup= BeautifulSoup(html_doc,'html.parser')

print(soup.prettify())

2.运行上述Python的代码之后可以再Python的shell运行界面对产生的soup变量进行操作,获取对html_doc变量的解析结果:

================ RESTART: E:/python_program/Dormouse_story.py ================
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters;and their name were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
>>> 
>>> 
>>> 
>>> soup.title()
[]
>>> soup.title
<title>The Dormouse's story</title>
>>> soup.title.name
'title'
>>> 
>>> soup.title.string
"The Dormouse's story"
>>> 
>>> soup.title.parent.name
'head'
>>> 
>>> soup.p
<p class="title"><b>The Dormouse's story</b></p>
>>> 

>>> soup.class
SyntaxError: invalid syntax
>>> soup.p[class]
SyntaxError: invalid syntax

>>> soup.p
<p class="title"><b>The Dormouse's story</b></p>

>>> soup.p['class']
['title']

>>> soup.p['class'][0]
'title'

>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
 
#从文档中找到所有<a>标签的链接:
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


>>> soup.find(id='link3')
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters;and their name were

Elsie,
Lacieand
Tillie;
and they lived at the bottom of a well.
...

3.从中可以看出:使用BeautifulSoup构造的soup变量当使用soup.tag_name.name时返回的是tag的名字,当使用soup.tag_name时返回的对应标签的内容,同时能够使用get_text()函数对使用将html语言编写的文档中的出去标签部分的字符串部分提取出来。

4.<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>可以起到对Elsie设置超链接的作用,在网页中 审查元素也可以看到类似的定义

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值