python网络爬虫-三度进击篇·BeautifulSoup

1.BeautifulSoup(bs4)

1.获取节点:

参考代码:

html="""
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

html:目标字符串,'lxml'为解析器(可更换任意其他解析器) 

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title.string)##获取标记内容
print(soup.head)#获取标记
print(soup.p)#当有多个相同标记时,只能匹配到第一个节点,忽略其他节点
print(soup.title.name)#获取节点名称

输出结果:

The Dormouse's story

<head><title>The Dormouse's story</title></head>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

title

2.获取属性:(html代码中内容同上)

attrs---简化类的定义

attrs属性会返回标签中的所有属性,返回的值是字典,

根据属性的性质来识别返回的是列表还是字符串,

如,class具有多个的含义,返回的便是列表。

from bs4 import BeautifulSoup
html = """
balabalabla
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs)
print(soup.p.attrs['name'])
print(soup.p.attrs['class'])
print(soup.p['class'])
print(soup.p['name'])

输出结果:

{'class': ['title'], 'name': 'dromouse'}

dromouse

['title']

['title']

dromouse

3. 嵌套调用

1.嵌套调用

参考html代码:

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

可以调用到某一元素的子集或后代集

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)
print(soup.p.children)

输出结果:

输出第一个<p>标签内的所有直接子元素

同样是标签内的所有直接子元素,但这次将返回三个<a>标签对象。

2. contents或children

获得子集,前者获得的是字符串列表,后者获得的是个列表迭代器对象,需要遍历获得具体内容

for i,j in enumerate(soup.p.children):
    print(i,j)

 输出结果:输出的是每个<a>标签的索引和标签对象。

0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4  
            and
            
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 
            and they lived at the bottom of a well.

3.调用descendants 

获得后代集

for i,j in enumerate(soup.p.descendants):
    print(i,j)

输出结果:打印出所有这些后代的索引和对象。

0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <span>Elsie</span>
4 Elsie
5 

6 

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9  
            and
            
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12 
            and they lived at the bottom of a well.
        

Process finished with exit code 0

4.获取父节点

 获取a节点的父节点,获得一个直接父节点 

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.a.parent)
print(soup.a.parents)

 遍历所有父节点:

for i,j in enumerate(soup.a.parents):
    print(i,j)

5.获取兄弟节点 

 1.获取a标记的下一个兄弟节点

print(soup.a.next_sibling)

2. 获取a标记的上一个兄弟节点

print(soup.a.previous_sibling)

3. 获取之后的所有兄弟节点

print(soup.a.next_siblings)
for i,j in enumerate(soup.a.next_siblings):
    print(i,j)

4.获取之前的所有兄弟节点

print(soup.a.previous_siblings)
for i,j in enumerate(soup.a.previous_siblings):
    print(i,j)

6.方法选择器

print(soup.find_all(name='ul'))
print(soup.find_all(name='ul')[0])
  • 找到文档中所有的<ul>标签。
  • 首先找到所有的<ul>标签,然后尝试访问列表中的第一个元素。
for ul in soup.find_all(name='ul'):
    for li in ul.find_all(name='li'):
        print(li.string)

遍历了所有<ul>标签,并在每个<ul>标签内找到了所有的<li>标签,

然后打印了每个<li>标签内的文本内容。

7.属性选择器

获取文本中id值为list-1的数据,返回列表

print(soup.find_all(attrs={"id":"list-1"}))

查找所有具有name="elements"属性的标签

print(soup.find_all(attrs={"name":"elements"}))

查找所有具有id="list-1"属性的标签

print(soup.find_all(id='list-1'))

查找所有具有class="list"属性的标签 

 由于class是环境中的关键字,因此需要下划线

print(soup.find_all(class_='list'))

可以利用正则表达式对页面内容进行匹配,返回列表

print(soup.find_all(string=re.compile("Foo")))

 8.css选择器

注:每个print需依次测试

from bs4 import BeautifulSoup

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
soup = BeautifulSoup(html,'lxml')
#select方法传入多个css选择器,获取到最后一个
print(soup.select(".panel .panel-body"))
print(soup.select(".panel, .panel-body"))

#元素选择器
print(soup.select("ul li"))
print(soup.select("#list-2"))

# #嵌套获取元素中的数据
for ul in soup.select("ul"):
    for li in ul.select("li"):
        print(li.string)

for ul in soup.select("ul"):
    print(ul['id'])

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值