Python爬虫学习——No.02

Beautiful Soup库

小测:获取网页源代码

Beautiful Soup库是解析、遍历、维护“标签树”的功能库;

import requests
r = requests.get("http://python123.io/ws/demo.html")
r.text
#输出:=========
demo = r.text
form bs4 import BeautifulSoup # bs4库
soup = BeautifulSoup(demo, "html.parser")
#输出:=====

怎么使用BS库:

html < == > 标签树 < == > BeautifulSoup类

from bs4 import BeautifulSoup # BeautifulSoup是一个类  BS4引用///

soup = BeautifulSoup('<p>data<p>', 'html.parser') # html.parser是解析器
soup2 = BeautifulSoup(open("D://demo.html"), 'html.parser')#文档

在这里插入图片描述
Beautiful Soup类的基本元素

基本元素说明
Tag标签,最基本的信息组织单元,分别用<>和</>表明开头和结尾
Name标签的名字,< p > . . . < /p > d的名字是’p’, 格式< tag >.name
Attributes标签的属性,字典形式组织,格式< tag >.attrs
NavigableString标签内非属性字符串,<>…</>中字符串 ,格式:< tag >.string
Comment标签内字符串的注释部分,一种特殊的Commen类型

让内容更加友好的显示prettify()

实际调试bs4库中取到很好的辅助作用

>>>from bs4 import BeautifulSoup
>>>soup = BeautufulSoup(demo, "html.parser")
>>>soup.prettify()#将每一个标签的后面加一个/n换行符
输出:'<html>\n <head>\n  <title>\n   This is a python demo page\n  </title>\n </head>\n <body>\n  <p class="title">\n   <b>\n    The demo python introduces several python courses.\n   </b>\n  </p>\n  <p class="course">\n   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n    Basic Python\n   </a>\n   and\n   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n    Advanced Python\n   </a>\n   .\n  </p>\n </body>\n</html>'
>>>print(soup.prettify())#打印出来
输出:<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>

还可以对每个标签进行处理

>>>print(soup.a.prettify())
输出:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
 Basic Python
</a>
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值