2020-11-19

最新推荐文章于 2024-01-18 02:07:05 发布

lempoo

最新推荐文章于 2024-01-18 02:07:05 发布

阅读量226

点赞数

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/lempoo/article/details/109816412

这篇博客探讨了如何获取网页上的图片数据，详细分析了静态文件中的图片爬取，特别是针对站长素材网站的高清图片。讲解了反爬机制中的图片懒加载问题，并介绍了数据解析的通用原理，包括使用bs4和xpath进行网页内容定位与提取。同时，讨论了爬取小说、糗事百科等实例，以及如何利用cookie和代理进行爬虫操作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

如何获取爬取图片数据?

两种方式
方式1:requests
方式2:urllib

示例
import requests
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}
#方式1:requests
url = 'https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1605757358956&di=9acd82a8edea5911205b117425b7b4c6&imgtype=0&src=http%3A%2F%2Fa0.att.hudong.com%2F30%2F29%2F01300000201438121627296084016.jpg'
img_data = requests.get(url=url,headers=headers).content #获取是content
#content返回bytes类型的数据
with open('./cat.jpg','wb') as fp:  #图片是bytes类型，不需要加encoding
    fp.write(img_data)


#方式2:urllib
import urllib
url = 'https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1605757358956&di=9acd82a8edea5911205b117425b7b4c6&imgtype=0&src=http%3A%2F%2Fa0.att.hudong.com%2F30%2F29%2F01300000201438121627296084016.jpg'
urllib.request.urlretrieve(url=url,filename='./cat_1.jpg') #注意request没有s

两种方式主要的区别
是否可以进行请求头的伪装,urllib不能

数据解析静态文件

正则
bs4(*)
xpath(*)
pyquery

在开发者工具中Elements和network中都可以显示当前页面的页面源码数据,区别是什么?
Elements:显示的页面源码数据是经过所有的请求加载和渲染完毕后对应的完整的页面源码数据，包括j，css等
如果页面存在动态加载数据,动态加载数据在Elements中是可以看到
如果不是动态加载的数据，可以到Elements中进行数据解析

network:对应response中显示的页面源码数据仅仅是当前这一个指定的数据包中请求到的数据,没有等到js，css等

站长素材高清图片爬取分析

反爬机制:图片懒加载.图片的img标签中应用了伪属性.只要当指定的事件满足后伪属性才会动态变成真正的属性名称.可视化范围内显示

import re
import os
dirName = 'ImgLibs'
if not os.path.exists(dirName):
    os.mkdir(dirName)

url = 'https://sc.chinaz.com/tupian/renwusuxie.html'
#获取整个页面数据
page_text = requests.get(url=url,headers=headers).text
#接下来就做数据解析:解析出图片地址
#匹配整个页面中的图片标签
ex = '<div class="box picblock col3".*?<img src2="(.*?)" alt.*?</div>'
#匹配图片路径
img_src = re.findall(ex,page_text,re.S)
for src in img_src:
    src = 'https:'+src
    #只有获取到图片地址才用content
    img_data = requests.get(url=src,headers=headers).content
    img_name = src.split('/')[-1]
    img_path = dirName+'/'+img_name
    with open(img_path,'wb') as fp:
        fp.write(img_data)
    print(img_name,'下载完毕!')

数据解析的通用原理

html负责展示数据.
html展示的数据是存储在哪里?
	html的标签
	标签的属性
数据解析的通用原理
	标签的定位
	提取数据

!ipconfig

bs4

环境安装:pip install bs4,pip install lxml

bs4独有的解析原理

1 实例化一个BeautifulSoup的对象,把即将被解析的页面源码数据加载到该对象中
2 调用该对象的相关方法和属性进行标签定位和数据提取
BeautifulSoup对象实例化

BeautifulSoup(fp,'lxml'):只可以解析本地存储的html文件
BeautifulSoup(page_text,'lxml'):可以解析从互联网中请求到的页面数据

from bs4 import BeautifulSoup
fp = open('./test.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,'lxml')
soup #注意,soup返回的是页面源码数据

定位

标签定位
soup.tagName,只可以定位到第一个出现的标签
例如： soup.title #对象点标签

属性定位:
格式为：soup.find（‘标签名’,属性=值）  #注意标签名是一个字符串
1 find('tagName',attrName='value'),只可以定位到第一个出现的标签
  例如：soup.find('div',class_='song')

#属性定位时候的class要加下划线

2 find_all('tagName',attrName='value'),可以定位到指定要求的所有标签

例如：soup.find_all(‘div’,class_=‘tang’)
#属性定位时候的class要加下划线

选择器定位:
soup.select('选择器')
eg: soup.select('#feng')

层级选择器:
大于号:表示一个层级

eg: soup.select(’.tang > ul > li’) #获取到li这一层的所有li标签和里面的标签

空格:表示多个层级
eg:  soup.select('.tang li')

取值先定位，在取值

取的标签中的值

string:只可以取出标签中直系的文本内容
eg; soup.find('a',id='feng').string

text:可以取出标签下所有的文本内容
eg；

取得属性中的值

soup.img['src']

爬取小说:https://www.shicimingju.com/book/sanguoyanyi.html

headers_sanguo = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
    'Connection':'closed'

}
main_url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
page_text = requests.get(main_url,headers=headers_sanguo).text
fp = open('./sanguo.txt','w',encoding='utf-8')
soup = BeautifulSoup(page_text,'lxml')
#解析:章节标题,详情页的url
a_tag_list = soup.select('.book-mulu > ul > li > a')
for a in a_tag_list:
    title = a.string #获取到章节
    #获取到详情的路径
    detail_url = 'https://www.shicimingju.com'+a['href']
    #爬取章节内容
    page_text_detail = requests.get(detail_url,headers=headers_sanguo).text
    soup_detail = BeautifulSoup(page_text_detail,'lxml')
    #解析:章节内容
    content = soup_detail.find('div',class_='chapter_content').text
    fp.write(title+':'+content+'\n')
    print(title,'下载成功!')
错误处理:HTTPSConnectionPool
产生的原因:
1.短时间内发起高频的请求,请求对应的ip被对方服务器禁掉
2.短时间内发起了高频的请求,请求的连接没有立即断开,这些请求将https连接池中的连接对象资源耗尽
处理:
1.代理机制
2.将用完连接立即断开
Connection:closed

xpath解析

环境安装:pip install lxml

xpath的解析原理

1.实例化一个etree对象,然后把即将被解析的数据加载到改对象中
etree.parse(fileName)  本地的
etree.HTML(page_text)  互联网的

eg:
from lxml import etree
tree = etree.parse(’./test.html’)

2.调用etree对象的xpath方法结合着不同形式的xpath表达式进行标签定位和数据提取

定位

标签定位
最左侧的/:表示一定要从根标签开始进行其他标签的定位
非最左侧的/:表示一个层级
非最左侧的//:表示多个层级
最左侧的//:相当于路径中的相对路径.可以从任意位置进行指定标签的定位
tree.xpath('/html/head/meta')
tree.xpath('//meta')
tree.xpath('/html//meta')
结果一致



属性定位:
//tagName[@attrName="value"]
eg:tree.xpath('//div[@class="tang"]')


索引定位:
//tagName[index]    index是从1开始
eg:tree.xpath('//div[@class="tang"]/ul/li[2]')
或者
 tree.xpath('//li[2]')

数据提取

取标签数据
/text() 1个
//text() 多个
eg:
tree.xpath('//a[@id="feng"]//text()') 
tree.xpath('//div[@class="tang"]//text()')
tree.xpath('//a[@id="feng"]//text()')[0]
tree.xpath('//a[@id="feng"]//text()') [0]
  

取属性数据
/@attrName
eg:
获取a标签的href属性值
tree.xpath('//a[@id="feng"]/@href ')

解析糗事百科

https://duanziwang.com/

将第一页的内容进行爬取

url = 'https://duanziwang.com/'
response = requests.get(url,headers=headers)
response.encoding = 'utf-8' #修改响应对象的编码格式来处理乱码问题
page_text = response.text
tree = etree.HTML(page_text)
#全局数据解析
article_list = tree.xpath('/html/body/section/div/div/main/article')
for article in article_list:
    #局部数据解析
    #./表示局部解析中xapth的调用者
    title = article.xpath('./div[1]/h1/a/text()')[0]
    content = article.xpath('./div[2]/pre/code/text()')[0]
    print(title,content)

爬取多页

#定义一个通用的url模板:生成其他页码的url
url = 'https://duanziwang.com/page/%d/index.html'
for page in range(1,6):
    if page == 1:
        new_url = 'https://duanziwang.com/'
    else:
        new_url = format(url%page)
    print('正在爬取第%d页的数据'%page)
    response = requests.get(new_url,headers=headers)
    response.encoding = 'utf-8' #修改响应对象的编码格式来处理乱码问题
    page_text = response.text
    tree = etree.HTML(page_text)
    #全局数据解析
    article_list = tree.xpath('/html/body/section/div/div/main/article')
    for article in article_list:
        #局部数据解析
        #./表示局部解析中xapth的调用者
        title = article.xpath('./div[1]/h1/a/text()')[0]
        content = article.xpath('./div[2]/pre/code/text()')[0]
        print(title,content)

百度AI使用流程

ai.baidu.com
控制台登录
定位到具体的功能模块
创建一个应用
进入SDK文档进行查看编码
from aip import AipSpeech

""" 你的 APPID AK SK """
APP_ID = '23008489'
API_KEY = 'eTvorFk6cfI12ZHcaUrCUlMB'
SECRET_KEY = '92gvtMbfTzuR3n0e6ZVSibqHBcLGgsRw'

client = AipSpeech(APP_ID, API_KEY, SECRET_KEY)

text = '先生,电梯仍在正常运行,只不过您进的那是电话间'
result  = client.synthesis(text, 'zh', 1, {
    'vol': 5,
    'per':4
})

# 识别正确返回语音二进制 错误则返回dict 参照下面错误码
if not isinstance(result, dict):
    with open('aaa.mp3', 'wb') as f:
        f.write(result)

爬取图片

url:http://pic.netbian.com/4kmeinv/

url = 'http://pic.netbian.com/4kmeinv/index_%d.html'
for page in range(1,6):
    if page == 1:
        new_url = 'http://pic.netbian.com/4kmeinv/'
    else:
        new_url = format(url%page)

    response = requests.get(new_url,headers=headers)
    response.encoding = 'gbk'
    page_text =response.text

    tree = etree.HTML(page_text)
    li_list = tree.xpath('//*[@id="main"]/div[3]/ul/li')
    for li in li_list:
        name = li.xpath('./a/img/@alt')[0]+'.jpg'
        img_src = 'http://pic.netbian.com'+li.xpath('./a/img/@src')[0]
        print(name,img_src)
xpath表达式的注意事项
xpath表达式中不可以出现tbody标签
可以在xpath表达式中使用管道符
//div/p[1] | //a/div/p/img/p[1]
作用:提升xpath表达式的通用性

cookie

cookie是一种形式的反爬机制

处理cookie的方式
1 手动处理
直接将抓包工具中的cookie写到headers字典里
2 自动处理
session对象.
requests.Session()返回一个session对象.
session对象可以像reqeusts一样调用get和post进行请求发送
如果使用session进行请求发送,在请求发送的过程中如果产生cookie,则cookie会被自动存储到session对象中.
如果cookie已经存储到session对象中,下次使用session对象发请求就是携带了cookie进行的请求发送

问题:在自动处理cookie的代码实现中,session对象至少会被使用几次?
两次.第一次调用session是为了捕获cookie将其存储到session.后续就可以使用携带cookie的session进行后续的请求发送

爬取雪球网中的新闻资讯数据 https://xueqiu.com/

import requests
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}
#无法获取数据
url = 'https://xueqiu.com/statuses/hot/listV2.json?since_id=-1&max_id=132719&size=15'
requests.get(url,headers=headers).json()
{'error_description': '遇到错误，请刷新页面或者重新登录帐号后再试',
 'error_uri': '/statuses/hot/listV2.json',
 'error_data': None,
 'error_code': '400016'}
 
#使用session获取数据
sess = requests.Session()
first_url = 'https://xueqiu.com/'
sess.get(first_url,headers=headers) #为了捕获cookie存储到sess里

url = 'https://xueqiu.com/statuses/hot/listV2.json?since_id=-1&max_id=132719&size=15'
sess.get(url,headers=headers).json()

代理

什么是代理
  就是代理服务器
代理服务器的作用
  转发请求和响应
代理的类型
  http:转发http协议的请求和响应
  https:转发https
代理的匿名度
	透明
	匿名
	高匿
在爬虫中为什么需要使用代理?
短时间内发起一个高频的请求,服务器就会将请求对应的ip加入黑名单,然后该请求对应的ip无法在对指定服务器进行请求发送

获取本机的原始IP地址

from lxml import etree
url = 'https://www.sogou.com/web?query=ip'
page_text = requests.get(url,headers=headers).text
tree = etree.HTML(page_text)
tree.xpath('//*[@id="ipsearchresult"]/strong/text()')[0]
结果：
'123.112.22.134\xa0\xa0\xa0\n\n                未知来源\n        '

使用代理,查看代理的IP

get或者post方法中应用proxies
proxies={'https':‘https://ip:port'}

代理的获取
代理精灵:http://http.zhiliandaili.cn/


import requests
url = 'https://www.sogou.com/web?query=ip'
page_text = requests.get(url,headers=headers,proxies={'https':'https://106.45.105.2:41942'}).text
tree = etree.HTML(page_text)
tree.xpath('//*[@id="ipsearchresult"]/strong/text()')[0]
'106.45.105.2\xa0\xa0\xa0\n\n                未知来源\n        '

代理池的构建

将提取的代理ip全部获取到，然后放到proxies里，记得加https://
import random
all_proxies = [] #代理池
proxy_url = 'http://t.ipjldl.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=4&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=2'
requests_textpropertyrequests.get(proxy_url,headers=headers).text
tree = etree.HTML(page_text)
ip_list = tree.xpath('//body//text()')
for ip in ip_list:
    dic['https'] = 'https://'+ip
    all_proxies.append(dic)

校花网使用上面的代理池

url = 'http://www.521609.com/tuku/index_%d.html'
all_data = []
for page in range(2,52):
    print('正在爬取第%d页的数据......'%page)
    new_url = format(url%page)
    page_text = requests.get(new_url,headers=headers,proxies=random.choice(all_proxies)).text
    tree = etree.HTML(page_text)
    li_list = tree.xpath('/html/body/div[4]/div[3]/ul/li')
    for li in li_list:
        img_url = 'http://www.521609.com'+li.xpath('./a/img/@src')[0]
        img_data = requests.get(img_url,headers=headers).content
        all_data.append(img_data)
print(len(all_data))

如果图片跳转就继续爬取一次