python爬虫 信息标记和信息提取

本文介绍了Python爬虫中三种常见的数据格式:XML、JSON和YAML。XML和JSON用于数据存储,其中JSON要求键和值都用双引号包围,不支持注释;YAML则允许无引号的键值且采用缩进表示嵌套。文章还通过实例展示了如何使用`.find_all()`方法解析和提取网页中的所有URL链接,包括对标签名、属性值的检索以及是否递归搜索子孙节点的控制。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

三种:

XML

JSON 

【无论是键还是值都要有双引号】

多值:

嵌套:

【JSON中无法添加注释】 

YAML 

【没有双引号】

嵌套

(用缩进表达)

并列:

整块数据:|

 

信息提取

 

 

实例:解析提取所有的URL链接

 

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> for link in soup.find_all('a'):
...     print(link.get('href'))
...
http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001
>>>

 .find_all()

1.name:对标签名字的检索字符串

>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic 
Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all(['a','b'])
[<b>The demo python introduces several python courses.</b>, <a class="py1" 
href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a 
class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>>

 所有标签的名字

>>> for tag in soup.find_all(True):
...     print(tag.name)
...
html
head
title
body
p
b
p
a
a
>>>

找以b为开头是标签(<b>  <body>)

>>> import re
>>> for tag in soup.find_all(re.compile('b')):
...     print(tag.name)
...
body
b
>>>

2.attrs:对标签属性值的检索字符串,可标注属性检索

找到带有'course'属性值的标签:

>>> soup.find_all('p','course')
[<p class="course">Python is a wonderful general-purpose programming language. You can 
learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic 
Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
>>>

查找id属性为"link1"的元素

>>> soup.find_all(id = 'link1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic 
Python</a>]
>>>
>>> soup.find_all(id = 'link')
[]
>>>
>>> soup.find_all(id =re.compile('link'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic 
Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" 
id="link2">Advanced Python</a>]
>>>

3.recursive   :是否对子孙进行全部检索,默认为True。

>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all('a',recursive=False)
[]
>>>

4.string :检索<>  ..  </ > 中的字符串

>>> soup
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> soup.find_all(string = 'Basic Python')
['Basic Python']
>>>
>>> import re
>>> soup.find_all(string = re.compile('python'))
['This is a python demo page', 'The demo python introduces several python courses.']
>>>

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值