BeautifulSoup库的基本元素
1. BeautifulSoup库的理解
BeautifulSoup库是解析、遍历、维护“标签树”的功能库。


BeautifulSoup库的引用
BeautifulSoup库,也叫beautifulsoup4或bs4
在python中引用需加入下面代码:
from bs4 import BeautifulSoup
# 或
import bs4
2. BeautifulSoup库的解析器

3. Beautiful Soup类的基本元素
| 基本元素 | 说明 |
|---|---|
| Tag | 标签,最基本的信息组织单元,分别用<>和<>标明开头和结尾 |
| Name | 标签的名字,< p >…< /p >的名字是’p’,格式:< tag >.name |
| Attributes | 标签的属性,字典形式组织,格式:< tag >.attrs |
| NavigableString | 标签内非属性字符创,<>…</>中字符串,格式:< tag >.string |
| Comment | 标签内字符串的注释部分,一种特殊的Comment类型 |

4. 使用代码获取HTML基本元素
依旧使用demo.html网页,网址:https://python123.io/ws/demo.html
from bs4 import BeautifulSoup
import requests
url = 'https://python123.io/ws/demo.html'
r = requests.get(url)
demo = r.text
soup = BeautifulSoup(demo, 'html.parser')
print(soup)
print(soup.title) # 得到title标签 <title>This is a python demo page</title>
tag = soup.a # 得到a标签 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
print(tag.name) # 得到a标签的名字 a
print(tag.parent.name) # 得到a标签父亲的名字,即上一层标签 p
print(tag.parent.parent.name) # 再上一层的名字 body
# 得到标签的属性信息
print(tag.attrs) # 返回的是一个字典 {'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
print(tag.attrs['class']) # 利用字典的形式得到一些列表信息 ['py1']
print(tag.attrs['href']) # http://www.icourse163.org/course/BIT-268001
print(type(tag.attrs)) # 查看标签的属性的类型 <class 'dict'>
print(type(tag)) # 查看tag的类型 <class 'bs4.element.Tag'>
print(soup.p.string) # The demo python introduces several python courses.
print(type(soup.p.string)) # <class 'bs4.element.NavigableString'>
输出结果:
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
<title>This is a python demo page</title>
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
a
p
body
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
['py1']
http://www.icourse163.org/course/BIT-268001
<class 'dict'>
<class 'bs4.element.Tag'>
The demo python introduces several python courses.
<class 'bs4.element.NavigableString'>
本文介绍BeautifulSoup库的基础元素,包括解析器、标签、属性等概念,通过实例演示如何使用Python代码获取HTML页面的基本元素。
2548

被折叠的 条评论
为什么被折叠?



