爬虫之数据解析_如何判断爬取得数据需要什么格式解析-优快云博客

本文链接：https://blog.youkuaiyun.com/Oliverchenxu/article/details/107087126

数据解析

正则(字符串匹配)

Re模块（表达式，方法）

1)规则

https://blog.youkuaiyun.com/CareChere/article/details/52315728?

2)检测

https://regex101.com/#javascript

xpath

1)安装

# 安装支持 解析html和XML的解析库 lxml
from lxml import etree

2）使用方法

1.转解析类型
xpath_data = etree.HTML(data)

2.调用 xpath的方法
result = xpath_data.xpath('/html/head/title//text()')

3.xpath语法 1. 节点 /
           2. 跨节点: //
           3. 精确的标签: //a[@属性="属性值"]   属性（mon/class）
           4. 标签包裹的内容 text() ；取网站：@href
           6. xpath返回数据类型->list
           7. xpath 下标 是从 1开始; 
           8. '//a[2]' 中[]只能取平级关系的标签不能跨节点
           9. # 路径 1. 纯手写  2. 借助浏览器的 右击 粘贴xpath路径; 需要修改
           10.模糊查询 //div[contain(@class,"a")]
           11.取下一个节点(平级)

4.保存数据格式-->json
# 将 list---str
data_str = json.dumps(self.data_list)

bs4

安装

pip install beautifulsoup4
from bs4 import BeautifulSoup

2）使用

四大类型：BeautifulSoup；Tag；NavigableString；Comment

1.转类型
soup = BeautifulSoup(html_doc, 'lxml')

2. 解析数据
#  find--返回符合查询条件的 第一个标签对象
result = soup.find(name='p',attrs={"class": "story"})

#  find_all--list(标签对象)
result = soup.find_all(name='p',attrs={"class": "story"})

#  select_one---css选择器
result = soup.select_one('.sister')

#  select----css选择器---返回list
#类选择器
result = soup.select('.sister')
#id选择器
result = soup.select('#one')
#后代选择器
result = soup.select('head title')
#组选择器
result = soup.select('title,.title')
#属性选择器
result = soup.select('a[id="link3"]')

#  取出标签包裹的内容---list
result = soup.select('.title')[0].get_text()

# 标签的属性
result = soup.select('#link1')[0].get('href')