python爬虫——BeautifulSoup库

本文介绍了如何使用BeautifulSoup库解析HTML文档,包括Tag、NavigableString、Comment等基本概念,以及如何通过各种方法遍历HTML标签树,获取所需信息。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

明确BeautifulSoup库的作用——解析html/xml等页面

基本格式:

>>>import requests
>>>r=requests.get("http://python123.io/ws/demo.html")
>>>demo=r.text
>>>from bs4 import BeautifulSoup
>>>soup=BeautifulSoup(demo,"html.parser")#bs4的html解释器

Tag标签

Tag——标签,最基本的信息组织单元,分别用<>和</>标明开头和结尾
格式:soup.< tag >

>>>soup.title
<title>This is a python demo page</title>

Tag的name(名字)

每个Tag都有自己的名字
格式:< tag >.name

>>> soup.a.name
'a'
>>> soup.a.parent.name #查找父名字
'p'
>>> type(soup.a.name)
<class 'str'>          #可知属性为字符串

Tag的attrs(属性)

Attributes——标签的属性,字典形式组织
格式:< tag >.attrs

>>> tag=soup.a
>>> tag.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> tag.attrs['class']   #检索键对应的值
['py1']
>>> type(tag.attrs)      #查询attrs属性
<class 'dict'>			#审查元素,可知其属性为字典

Tag的NavigableString

NavigableString——标签内的非属性字符串,一般其中包含了我们需要的文字内容,<>…</>中字符串
格式:< tag >.string

>>> soup.a.string
'Basic Python'
>>> soup.b.string
'The demo python introduces several python courses.'
>>> type(soup.a.string)
<class 'bs4.element.NavigableString'>   #审查属性,可知其为NavigableString

Tag的Comment

Comment——标签内字符串的注释部分,一种特殊的Comment类型

在这里插入图片描述

在这里插入图片描述

遍历

标签树的下行遍历

.contents——子节点列表,将< tag >所有儿子结点存入列表
.children——子节点的迭代类型,与.contents类似,用于循环遍历儿子结点
.descendants——子孙节点的迭代类型,包含所有子孙节点,用于循环遍历
通过循环我们可以得到二维数据块

>>> soup.head
<head><title>This is a python demo page</title></head>
>>> soup.head.contents
[<title>This is a python demo page</title>]
>>> soup.body.contents
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
>>> len(soup.body.contents)				#检查长度
5									
>>> soup.body.contents[1]
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>> for child in soup.body.children:
	print(child)



<p class="title"><b>The demo python introduces several python courses.</b></p>


<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>



>>> for child in soup.body.descendants:
	print(child)



<p class="title"><b>The demo python introduces several python courses.</b></p>
<b>The demo python introduces several python courses.</b>
The demo python introduces several python courses.


<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:


<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
 and 
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Advanced Python
.

标签树的上行遍历

.parent——节点的父亲标签
.parents——节点先辈标签的迭代类型,用于循环遍历先辈节点

>>> soup.title.parent
<head><title>This is a python demo page</title></head>
>>> soup.html.parent  				#即本身
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> for parent in soup.a.parents:   #循环遍历
	if parent is None:			#soup.parent为空
		print(parent) 			
	else:
		print(parent.name)

p
body
html
[document]

在这里插入图片描述

标签树的平行遍历

在这里插入图片描述
在这里插入图片描述

>>> soup.a.next_sibling
' and '
>>> soup.a.next_sibling.next_sibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
for sibing in soup.a.next_sibling:
	print(sibling)					#循环遍历后续节点
for sibling in soup.a.previous_sibling:
	print(sibling)					#循环遍历前续节点

美化输出——prettify()

用于标签
格式:< tag >.prettify()

>>> print(soup.a.prettify())
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
 Basic Python
</a>
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值