BeautifulSoup-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_43670105/article/details/88928472

BeautifulSoup “美味的汤，绿色的浓汤”
一个灵活又方便的网页解析库，处理高效，支持多种解析器。
利用它就不用编写正则表达式也能方便的实现网页信息的抓取
先来看一个网页https://python123.io/ws/demo.html

在这里插入图片描述
BeautifulSoup库通俗来说是【解析、遍历、维护“标签树”(例如html、xml等格式的数据对象)的功能库】

import requests
from bs4 import BeautifulSoup

r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo,"html.parser")    #解释器是html.parser
print(soup.prettify())

结果是：

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

可以看出网页被正确解析出来

Beautiful Soup解析器

在这里插入图片描述
Beautiful Soup基本元素

Tag
Name
Attributes
NavigableString
Comment

在这里插入图片描述

print(soup.title)
print(soup.a)    #当文档中存在多个相同标签时，会返回其中第一个a标签

<title>This is a python demo page</title>
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

#获取标签名字
print(soup.a.name)
print(soup.a.parent.name)    #a的父亲的标签
print(soup.a.parent.parent.name)  #a的父亲的父亲的标签
print(soup.a.parent.parent.parent.name)  #a的父亲的父亲的父亲的标签

a
p
body
html

#标签的属性
tag = soup.a
print(tag.attrs)   #打印一个字典
print(tag.attrs['class'])

{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
['py1']

print(soup.a.string)
print(soup.title.string)

Basic Python
This is a python demo page

在这里插入图片描述

print(soup.head.contents)   #head的儿子标签
print(soup.body.contents)

[<title>This is a python demo page</title>]
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']

在这里插入图片描述

for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

p
body
html
[document]

在遍历一个标签的所有先辈时，会遍历到soup本身，而soup的先辈不存在name信息，所以为空。

在这里插入图片描述
注意：所有的平行遍历都必须发生在同一个父亲节点下

在这里插入图片描述