python爬虫120源码DIY–001–抓取桌面壁纸
本次抓取所用目标网址:http://www.netbian.com/fengjing/,内含N多高清壁纸图,初始只是预览图,真正的高清图在后面,需要点开两次链接后获得,所以这里用爬虫组合出高清图地址并抓取之.
参考博文:
https://blog.youkuaiyun.com/hihell/article/details/117024328?utm_source=app&app_version=4.16.0&code=app_1562916241&uLinkId=usr1mkqgl919blen
原文用的是正则表达式和re模块,且下载的只是预览图,我在此基础上重新改写可获得高清图,本文仅供参考,提供一个思路
工具原料:
-
python3
-
requests库
-
parsel库
思路解析:
-
分析页面,如图所示,每个页面的预览图都在list这个列表集合下.每张图对应一个缩写的地址:"<a href="/desk/23791.htm…"
-
使用parsel库解析出最初的地址,其中夹杂了不需要的网址
url = 'http://www.netbian.com/fengjing/' response = requests.get(url) sel = parsel.Selector(response.content.decode('gbk')) lists = sel.css('.list li a::attr(href)').extract() # 获得初始地址
-
过滤不需要的地址
查看链接获得高清图像的地址为:http://www.netbian.com/desk/23791-1920x1080.htm,这里使用startswith()方法挑选出适合的地址并存入新列表,这里直接用一个函数清洗地址并重新组装
# 清洗组装地址 def clearurl(lists): nurls = [] for i in lists: if i.startswith('/desk/'): i = wurl + i[:-4] + '-1920x1080.htm' nurls.append(i) return nurls
最终获得如下地址列表:
-
解析高清图片所在网页,获得最终图片地址并下载
从网页源码中获得图片地址,如下图:
gqurls = clearurl(lists) response = requests.get(gqurls[0]) #此处仅作测试,抽取第一个地址 sel = parsel.Selector(response.content.decode('gbk')) gpic = sel.css('td a::attr(href)').extract_first() image = requests.get(gpic).content with open('../eg001/'+'1.jpg', 'wb') as f: f.write(image)
至此,第一个图片从解析到下载基本完成.
-
解析分页,获取更多图片下载地址
分析网页地址变化规律得出地址为:
http://www.netbian.com/fengjing/index.htm
http://www.netbian.com/fengjing/index_2.htm
http://www.netbian.com/fengjing/index_3.htm
…
http://www.netbian.com/fengjing/index_205.htm
除了第一页地址不带序号,之后都是有规律的地址,将url地址重整为list形式
def urls(): url_list = ['http://www.netbian.com/fengjing/index_{}.htm'.format(i) for i in range(2, 206)] url_list.insert(0, 'http://www.netbian.com/fengjing/') return url_list
-
封装一下sel对象
因为要重复调用sel对象解析网页,所以将sel做出函数形式
# 获得sel对象 def t_sel(url): response = requests.get(url) sel = parsel.Selector(response.content.decode('gbk')) return sel
完整源码:
#!/usr/bin/env python
# coding=utf-8
'''
001号
壁纸爬取
http://www.netbian.com/fengjing/
'''
import requests
import parsel
url = 'http://www.netbian.com/fengjing/'
wurl = 'http://www.netbian.com'
# 获得分页url地址列表
def urls():
url_list = ['http://www.netbian.com/fengjing/index_{}.htm'.format(i) for i in range(2, 3)]
url_list.insert(0, url)
return url_list
# 组装sel对象
def t_sel(url):
response = requests.get(url)
sel = parsel.Selector(response.content.decode('gbk'))
return sel
# 清洗组装获得高清地址
def clearurl(lists):
nurls = []
for i in lists:
# print(i)
if i.startswith('/desk/'):
i = wurl + i[:-4] + '-1920x1080.htm'
nurls.append(i)
return nurls
def savepic(gqurls):
for g_url in gqurls:
sel = t_sel(g_url)
gpic = sel.css('td a::attr(href)').extract_first()
image = requests.get(gpic).content
with open('../eg001/' + str(g_url[28:-4]) + '.jpg', 'wb') as f:
f.write(image)
if __name__ == '__main__':
ulist = urls()
for url in ulist:
sel = t_sel(url)
lists = sel.css('.list li a::attr(href)').extract() # 获得初始地址
gqurls = clearurl(lists)
savepic(gqurls)