最近要在百度上下载图片座测试, 不想手动下载, 因此研究了一下自动下载脚本.
成果如下:
# -*- coding: utf-8 -*-
import os
import urllib2
import json
tags = ['运动服']
urls = [];
savePath = './'
for tag2 in tags:
print 'start download theme :' , tag2
startNum = 0 ; # the index of the start image to download
resultNum = 60 # the number of images one time can be got form baidu image by json , 60 is the upper bound
endnum = 3000
totalNum = -1 # the total number of the theme images
downloadNum = 0
path = unicode(savePath + '/' + tag2 + '/' , 'utf8')
if not os.path.exists(path):
os.makedirs(path)
while totalNum == -1 or startNum < totalNum or startNum > endnum:
oneRequeseNum = 0
try:
url = 'http://image.baidu.com/i?tn=baiduimagejson&width=&height=&ie=utf8&oe=utf-8&word=' + tag2 + '&pn=' + str(startNum) + '&rn=' + str(resultNum)
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {"User-Agent" : user_agent}
req = urllib2.Request(url , headers=headers)
html = urllib2.urlopen(req , timeout=100)
jsonData = json.loads(html.read())
# print jsonData
if totalNum == -1:
totalNum = jsonData['displayNum']
print 'toatl number :', totalNum
data = jsonData['data']
for index , item in enumerate(data):
oneRequeseNum += 1
if item.has_key("objURL"):
url = item['objURL']
urls.append(url);
except Exception , e:
print "Exception : " , str(e)
print url
oneRequeseNum = oneRequeseNum+100
finally:
startNum = startNum + oneRequeseNum
print 'Finish download theme : ' , tag2
print 'Download images number :' , startNum
ff = open('urls.txt','w');
for url in urls:
ff.write('%s\n'% url)
ff.close()
这里有个注意的地方: url中的utf8等关键字需要加载在str之前. 如果加载再之后, 我的程序报错.
参考:
http://blog.youkuaiyun.com/yuanwofei/article/details/16343743
http://www.devba.com/index.php/archives/3321.html
http://blog.youkuaiyun.com/viomag/article/details/38340993
以及原本代码是https://github.com/busz/BaiduImageDownloader