明确BeautifulSoup库的作用——解析html/xml等页面
基本格式:
>>>import requests
>>>r=requests.get("http://python123.io/ws/demo.html")
>>>demo=r.text
>>>from bs4 import BeautifulSoup
>>>soup=BeautifulSoup(demo,"html.parser")#bs4的html解释器
Tag标签
Tag——标签,最基本的信息组织单元,分别用<>和</>标明开头和结尾
格式:soup.< tag >
>>>soup.title
<title>This is a python demo page</title>
Tag的name(名字)
每个Tag都有自己的名字
格式:< tag >.name
>>> soup.a.name
'a'
>>> soup.a.parent.name #查找父名字
'p'
>>> type(soup.a.name)
<class 'str'> #可知属性为字符串
Tag的attrs(属性)
Attributes——标签的属性,字典形式组织
格式:< tag >.attrs
>>> tag=soup.a
>>> tag.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> tag.attrs['class'] #检索键对应的值
['py1']
>>> type(tag.attrs) #查询attrs属性
<class 'dict'> #审查元素,可知其属性为字典
Tag的NavigableString
NavigableString——标签内的非属性字符串,一般其中包含了我们需要的文字内容,<>…</>中字符串
格式:< tag >.string
>>> soup.a.string
'Basic Python'
>>> soup.b.string
'The demo python introduces several python courses.'
>>> type(soup.a.string)
<class 'bs4.element.NavigableString'> #审查属性,可知其为NavigableString
Tag的Comment
Comment——标签内字符串的注释部分,一种特殊的Comment类型
遍历
标签树的下行遍历
.contents——子节点列表,将< tag >所有儿子结点存入列表
.children——子节点的迭代类型,与.contents类似,用于循环遍历儿子结点
.descendants——子孙节点的迭代类型,包含所有子孙节点,用于循环遍历
通过循环我们可以得到二维数据块
>>> soup.head
<head><title>This is a python demo page</title></head>
>>> soup.head.contents
[<title>This is a python demo page</title>]
>>> soup.body.contents
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
>>> len(soup.body.contents) #检查长度
5
>>> soup.body.contents[1]
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>> for child in soup.body.children:
print(child)
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
>>> for child in soup.body.descendants:
print(child)
<p class="title"><b>The demo python introduces several python courses.</b></p>
<b>The demo python introduces several python courses.</b>
The demo python introduces several python courses.
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Advanced Python
.
标签树的上行遍历
.parent——节点的父亲标签
.parents——节点先辈标签的迭代类型,用于循环遍历先辈节点
>>> soup.title.parent
<head><title>This is a python demo page</title></head>
>>> soup.html.parent #即本身
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> for parent in soup.a.parents: #循环遍历
if parent is None: #soup.parent为空
print(parent)
else:
print(parent.name)
p
body
html
[document]
标签树的平行遍历
>>> soup.a.next_sibling
' and '
>>> soup.a.next_sibling.next_sibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
for sibing in soup.a.next_sibling:
print(sibling) #循环遍历后续节点
for sibling in soup.a.previous_sibling:
print(sibling) #循环遍历前续节点
美化输出——prettify()
用于标签
格式:< tag >.prettify()
>>> print(soup.a.prettify())
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>