python 爬虫数据解析_图片吧-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_30307045/article/details/107938548

本文探讨了爬虫中常见的问题如IP封禁及解决方案，介绍了数据解析的重要性及其在聚焦爬虫中的应用。通过对比正则、bs4、xpath和pyquery等方法，详细讲解了它们的原理和实例，包括如何利用这些工具下载视频、图片和文字内容。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

ip代理

被封了加代理
http://www.goubanjia.com/

HttpConnectionPool

- 原因：
	- 短时间内发起了高频的请求导致ip被封
	- http连接池中的连接资源被耗尽
- 解决：
	- 代理
	- headers中加入Conection："close"

数据解析

数据解析可以帮助我们实现聚焦爬虫

数据解析的实现方式

正则：爬取快，但正则写起来慢
bs4
xpath：通用性比较强
pyquery

数据解析的通用原理

爬取的数据都被存储在相关标签之中和相应的标签属性中
- 定位标签
- 取文本或者属性值

通过正则简单爬个糗事百科视频吧

import requests
import re
import os

dirName = './videos/'
if not os.path.exists(dirName):
   os.mkdir(dirName)

headers = {
   "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"
}
url = 'https://www.qiushibaike.com/video/'

response = requests.get(url=url, headers=headers)
page_text = response.text
# <source src="//qiubai-video.qiushibaike.com/0Y2D9HPF1XN2MA91_hd.mp4" type='video/mp4' />
ex = '<source src="(.*?)" type.*?> '
page_list_video = re.findall(ex, page_text, re.S)
for v in page_list_video:
   v = "https:" + v
   video_name = dirName + v.split("/")[-1]
   # 方式一：
   # response_video = requests.get(v, headers=headers).content  # bytes类型数据
   # with open(video_name, "wb") as fp:
   #     fp.write(response_video)
   # 方式二:
   from urllib import request
   request.urlretrieve(url=v, filename=video_name)

bs4解析

bs4解析的原理

实例化一个BeautifulSoup的对象，需要将即将被接卸的页面源码数据加载到该对象中
调用BeaufitulSoup对象中的响应方法和属性进行标签定位和数据提取
环境安装
- pip install bs4
- pip install lxml

BeautifulSoup的实例化

BeautifulSoup(fp, ‘lxml’) 将本地存储的一个html文档中的数据加载到实例化好的BeautifulSoup对象中
BeautifulSoup(page_text, ‘lxml’) 将从互联网上获取的页面源码数据加载到实例化好的BeautifulSoup对象中

定位标签的操作
- soup.tagName: 定位到第一个出现的tagName标签
- 属性定位：
  - soup.find(‘tagName’, attrName=‘value’)
  - soup.find_all(‘tagName’, attrName=‘value’) 返回值为列表
- 选择器定位器：
  - 层级选择器：> 表示一个层级空格表示多个层级
  - soup.select(‘选择器’)
    - soup.select(’#feng’)
    - soup.select(’.tang > ul > li’)
    - soup.select(’.tang li’)
  - 取文本
    - string：获取直系的文本内容
    - text：获取所有的文本内容
      - a_tag = soup.select(’#feng’)[0].string
      - a_tag = soup.select(’#feng’)[0].text
    - 取属性
      - tagName[‘attrName’]

通过BeautifulSoup下载个三国吧

from bs4 import BeautifulSoup
import requests

url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
}
response_text = requests.get(url=url, headers=headers).text
soup = BeautifulSoup(response_text, 'lxml')
a_list = soup.select('.book-mulu > ul > li > a')
fp = open('./sg.txt', 'w', encoding='utf-8')
for a in a_list:
    title = a.string
    content_url = "https://www.shicimingju.com" + a['href']
    content_text = requests.get(url=content_url, headers=headers).text
    soup = BeautifulSoup(content_text, 'lxml')
    content = soup.find('div', class_="chapter_content").text
    fp.write(f"{title}\n{content}\n\n")
    print(f'{title}下载成功！')
fp.close()

xpath解析

xpath解析原理

实例化一个etree的对象，然后将即将被解析的页面源码加载到该对象中
使用etree对象中的xpath方法结合着不同形式的xpath表达式实现标签定位和数据提取
环境安装
- pip install lxml
etree对象的实例化
- etree.parse(“text.html”)
- etree.HTML(page_text)

xpath表达式

最左侧的/表示：xpath表达式一定要从根标签逐层进行标签查找和定位
- tree.xpath(’/html/body/div/p’)
最左侧//表示：xpath表达式可以从任意位置定位标签
- tree.xpath(’//p’)
非最左侧的/：表示一个层级
非最左侧的//：表示跨多个层级
- tree.xpath(’/html/body//p’)
xpath可用| 管道符连接两个表达式，提高xpath的通用性
定位标签的操作
- 属性定位：//tagName[@attrName=“value”]
  - //div[@class=‘song’]
- 索引定位：//tagName[index] 索引是从1开始
取文本：
- /text()：直系文本内容
  - tree.xpath(’//a[@id=“feng”]/text()’)[0]
- //text()：所有的文本内容
取属性：/@attrName
- tree.xpath(’//a[@id=“feng”]/@href’)

通过xpath简单爬取个糗事百科吧

import requests
from lxml import etree

url = 'https://www.qiushibaike.com/text/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
}
response_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(response_text)
div_list = tree.xpath("//div[@class='article block untagged mb15 typs_hot']")
for div in div_list:
    author = div.xpath("./div/a[2]/h2/text()")[0]   # 实现局部解析
    content = div.xpath("./a[1]/div/span/text()")   # 里面有<br>标签，需要处理一下
    content = "".join(content)
    print(author, content)

使用管道符|连接两个表达式

import requests
from lxml import etree

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"
}
url = "https://www.aqistudy.cn/historydata/"
response_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(response_text)
citys = tree.xpath("//div[@class='bottom']/ul/li/a/text()| //div[@class='bottom']/ul/div[2]/li/a/text()")
print(citys)

解决中文乱码

数据.encode(“iso-8859-1”).decode(“gbk”) # iso-8859-1适用范围更广一些

爬些美女图片吧

import requests
from lxml import etree
import os

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"
}

dirName = './mv'
if not os.path.exists(dirName):
    os.mkdir(dirName)


url = "http://pic.netbian.com/4kmeinv/index_%d.html"
for page in range(1, 10):
    if page == 1:
        new_url = 'http://pic.netbian.com/4kmeinv/index.html'
    else:
        new_url = format(url%page)

    response_text = requests.get(url=new_url, headers=headers).text
    tree = etree.HTML(response_text)
    a_list = tree.xpath("//div[@class='slist']/ul/li/a")
    for i in a_list:
        pic_path = "http://pic.netbian.com/" + i.xpath("./img/@src")[0]
        pic_name = i.xpath("./b/text()")[0]+".jpg"
        pic_name = pic_name.encode("iso-8859-1").decode("gbk")      # 解决编码问题
        pic = requests.get(url=pic_path, headers=headers).content
        with open(dirName+"/"+pic_name, "wb") as fp:
            fp.write(pic)