首先有一个test_baidu.html的文件,我放在了文末
- Tag 标签及其内容(只能拿到它所找到的第一个内容)
执行下面程序,
from bs4 import BeautifulSoup
file = open('test_baidu.html','rb')
html = file.read()
bs = BeautifulSoup(html, "html.parser")
print(bs.title)
print(type(bs.title))
运行结果如下
<title>百度一下,你就知道</title>
<class 'bs4.element.Tag'>
(,这样拿的是标签及其内容),要想之拿到内容
那就
print(bs.title.string)
这就是下面要讲到的NavigableString
- NavigableString标签里的字符串bs.title.string
from bs4 import BeautifulSoup
file = open('test_baidu.html','rb')
html = file.read()
bs = BeautifulSoup(html, "html.parser")
print(bs.title.string)
print(type(bs.title.string))
运行结果
百度一下,你就知道
<class 'bs4.element.NavigableString'>
下面方法拿到标签里边的属性bs.link.attrs
from bs4 import BeautifulSoup
file = open('test_baidu.html','rb')
html = file.read()
bs = BeautifulSoup(html, "html.parser")
print(bs.link.attrs)
打印结果
{'rel': ['shortcut', 'icon'], 'href': '/favicon.ico', 'type': 'image/x-icon'}
如果我想要整个文档怎么办呢,就看下面方法
- bs4.BeautifulSoup类型
from bs4 import BeautifulSoup
file = open('test_baidu.html','rb')
html = file.read()
bs = BeautifulSoup(html, "html.parser")
print(type(bs))
print(bs.name)
print(bs.attrs)
运行结果
<class 'bs4.BeautifulSoup'>
[document]
{}
- 注释类型 comment
from bs4 import BeautifulSoup
file = open('test_baidu.html','rb')
html = file.read()
bs = BeautifulSoup(html, "html.parser")
print(bs.a)
print(bs.a.string)
print(type(bs.a.string))
运行结果
<a class="mnav c-font-normal c-color-t" href="http://news.baidu.com" target="_blank"><!--新闻--></a>
新闻
<class 'bs4.element.Comment'>
可见默认会将注释符号去除掉,只显示注释内容
应用(遍历文件树)
下面这个方法可以将head的所有子节点按列表方式打印出来
from bs4 import BeautifulSoup
file = open('test_baidu.html','rb')
html = file.read()
bs = BeautifulSoup(html, "html.parser")
print(bs.head.contents)
下面讲几种搜素方法
- find_all方法
from bs4 import BeautifulSoup
file = open('test_baidu.html','rb')
html = file.read()
bs = BeautifulSoup(html, "html.parser")
# 文档的搜素
# (1)find_all,将所有含a标签的全部存到一个list中
# 字符串过滤,会查找与你输入的完全一样的保存起来,例如
# 下面的方式就不会找到span标签,必须完全匹配
t_list = bs.find_all('a')
#print(t_list)
# (2)使用正则表达式search匹配内容
import re
t_list = bs.find_all(re.compile('a'))#含a的就全部找出来
#print(t_list)
# (3)传入一个函数的方法,根据函数要求来搜素
def class_is_exist(tag):
return tag.has_attr('class')
t_list = bs.find_all(class_is_exist)
print(t_list)
# (4)kwargs,直接在里面指定参数
t_list = bs.find_all(target="_blank")
print(t_list)
#(5)text参数
t_list = bs.find_all(text='百度一下,你就知道')
print(t_list)
t_list = bs.find_all(text=['新闻', '百度一下,你就知道'])
print(t_list)
#下面运用正则表达式查找包含特定文本的内容(标签里的字符串)
t_list = bs.find_all(text=re.compile("\d"))
print(t_list)
#(5)limit参数
t_list = bs.find_all(text=re.compile("\d"), limit = 2)
print(t_list)
- css选择器
from bs4 import BeautifulSoup
file = open('test_baidu.html', 'rb')
html = file.read()
bs = BeautifulSoup(html, "html.parser")
t_list = bs.select('title')
print(t_list)
运行结果为以下,可以看到返回的也是列表
[<title>百度一下,你就知道</title>]
#通过类名查找
t_list = bs.select('.show-weather') # 这里的.是因为在css中.代表class
print(t_list)
#通过id来查找
t_list = bs.select('#s_mod_weather') # 这里的#是因为在css中#代表id
print(t_list)
#通过属性来查找
t_list = bs.select("div[class = 'show-weather']") # 这里是div标签下的class属性
print(t_list)
#通过主子标签查找
t_list = bs.select('div > div')#查找div下面的div标签
print(t_list)
#通过兄弟标签查找
t_list = bs.select('.city ~ .weather-mod-link')
print(t_list)
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<meta content="always" name="referrer">
<meta name="theme-color" content="#2932e1">
<meta name="description" content="全球最大的中文搜索引擎、致力于让网民更便捷地获取信息,找到所求。百度超过千亿的中文网页数据库,可以瞬间找到相关的搜索结果。">
<link rel="shortcut icon" href="/favicon.ico" type="image/x-icon"/>
<link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" title="百度搜索"/>
<link rel="icon" sizes="any" mask href="//www.baidu.com/img/baidu_85beaf5496f291521eb75ba38eacbd87.svg">
<link rel="dns-prefetch" href="//dss0.bdstatic.com"/>
<link rel="dns-prefetch" href="//dss1.bdstatic.com"/>
<link rel="dns-prefetch" href="//ss1.bdstatic.com"/>
<link rel="dns-prefetch" href="//sp0.baidu.com"/>
<link rel="dns-prefetch" href="//sp1.baidu.com"/>
<link rel="dns-prefetch" href="//sp2.baidu.com"/>
<title>百度一下,你就知道</title>
<a href="http://news.baidu.com" target="_blank" class="mnav c-font-normal c-color-t"><!--新闻--></a>
<a href="http://news.baidu.com" target="_blank" class="mnav c-font-normal c-color-t">新闻</a>
<a href="http://news.baidu.com" target="_blank" class="mnav c-font-normal c-color-t">山西123</a>
<a href="http://news.baidu.com" target="_blank" class="mnav c-font-normal c-color-t">长治456</a>
<a href="http://news.baidu.com" target="_blank" class="mnav c-font-normal c-color-t">789</a>
<a class="city" href="//www.baidu.com/s?tn=baidutop10&rsv_idx=2&wd=%E5%A4%A9%E6%B0%94%E9%A2%84%E6%8A%A5"
target="_blank" class="weather-mod-link"></a>
<a class="country" href="//www.baidu.com/s?tn=baidutop10&rsv_idx=2&wd=%E5%A4%A9%E6%B0%94%E9%A2%84%E6%8A%A5"
target="_blank" class="weather-mod-link"></a>
<a class="shanxi-wather" href="//www.baidu.com/s?tn=baidutop10&rsv_idx=2&wd=%E5%A4%A9%E6%B0%94%E9%A2%84%E6%8A%A5"
target="_blank" class="weather-mod-link"> </a>
<div id="s_mod_weather" class="s-mod-weather s-isindex-wrap hide-weater">
<div class="weather-mod"><a class="city-wather"
href="//www.baidu.com/s?tn=baidutop10&rsv_idx=2&wd=%E5%A4%A9%E6%B0%94%E9%A2%84%E6%8A%A5"
target="_blank" class="weather-mod-link">
<div class="show-weather"><span class="show-city"><span class="show-city-name c-font-normal c-color-t"
data-key=""></span></span><span
class="show-icon"><img class="weather-icon"
src="https://dss1.bdstatic.com/5aV1bjqh_Q23odCf/static/superman/img/weather/icons/.png"
;/>