千图网:http://www.58pic.com
首先分析网页结构:
第一页:http://www.58pic.com/haibaomoban/0/id-0.html
第二页:http://www.58pic.com/haibaomoban/0/id-1.html
…
对每页的图片缩略图分析:
缩略图:http://pic. qiant ucdn. com/58pic/23/35/02/53n58PICCrZ. ipg!gtwebp226
高清图:http://pic. qiant ucdn. com/58pic/23/35/02/53n58PICCrZ._1024.jpg
个别无规律图片做异常处理
import urllib.request
import re
import urllib.error
def main():
for i in range(1,10): #爬取网站1到10页
pageurl = "http://www.58pic.com/haibaomoban/0/id-"+str(i)+".html"
data = urllib.request.urlopen(pageurl).read().decode("utf-8","ignore")
pat = '<a class="thumb-box".*?src="(.*?).jpg!'
imglist = re.compile(pat).findall(data)
for j in range(0,len(imglist)): #爬取高清图片
try:
thisimg = imglist[j] #当前图片地址(非高清)
thisimgurl = thisimg + "_1024.jpg" # 高清图片
# 存储路径
file = "E:/Python数据分析与挖掘课程/result/32/"+str(i)+str(j)+".jpg"
urllib.request.urlretrieve(thisimgurl,filename=file)
print("第"+str(i)+"页第"+str(j)+"'个图片爬取成功")
except urllib.error.URLErroras e:
if hasattr(e,"code"):
print(e.code)
if hasattr(e,"code"):
print(e.code)
except Exception as e:
print(e)
if __name__ == '__main__':
main()
结果如下