三种:
XML
JSON
【无论是键还是值都要有双引号】
多值:
嵌套:
【JSON中无法添加注释】
YAML
【没有双引号】
嵌套
(用缩进表达)
并列:
整块数据:|
信息提取
实例:解析提取所有的URL链接
>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> for link in soup.find_all('a'):
... print(link.get('href'))
...
http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001
>>>
.find_all()
1.name:对标签名字的检索字符串
>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic
Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all(['a','b'])
[<b>The demo python introduces several python courses.</b>, <a class="py1"
href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a
class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>>
所有标签的名字
>>> for tag in soup.find_all(True):
... print(tag.name)
...
html
head
title
body
p
b
p
a
a
>>>
找以b为开头是标签(<b> <body>)
>>> import re
>>> for tag in soup.find_all(re.compile('b')):
... print(tag.name)
...
body
b
>>>
2.attrs:对标签属性值的检索字符串,可标注属性检索。
找到带有'course'属性值的标签:
>>> soup.find_all('p','course')
[<p class="course">Python is a wonderful general-purpose programming language. You can
learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic
Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
>>>
查找id属性为"link1"的元素
>>> soup.find_all(id = 'link1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic
Python</a>]
>>>
>>> soup.find_all(id = 'link')
[]
>>>
>>> soup.find_all(id =re.compile('link'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic
Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001"
id="link2">Advanced Python</a>]
>>>
3.recursive :是否对子孙进行全部检索,默认为True。
>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all('a',recursive=False)
[]
>>>
4.string :检索<> .. </ > 中的字符串
>>> soup
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> soup.find_all(string = 'Basic Python')
['Basic Python']
>>>
>>> import re
>>> soup.find_all(string = re.compile('python'))
['This is a python demo page', 'The demo python introduces several python courses.']
>>>