需要爬取国内某个网站,但是这个网站封ip,没办法,只能用代理了,然后构建自己的代理池,代理池维护了20条进程,
所用的网络是20M带宽,实际的网速能达到2.5M,考虑到其他原因,网速未必能达到那么多。爬虫对网速的要求挺高的。
首先把 URL 图片的链接 抓取下来 保存到数据库中去,然后使用多进程进行图片的抓取。
经过测试 开40个进程,一分钟能采集200张图片,但是开60个进程,图片下降到了一分钟120张。
注意: 抓取图片的时候,或者抓取视频的时候,一定要加上请求头,实现图片的压缩传输。
下面直接粘贴出来代码:
# coding:utf-8
from common.contest import *
def save_img(source_url, dir_path, file_name,maxQuests= 11):
maxQuests =maxQuests
headers = {
"Host":"img5.artron.net",
"Connection":"keep-alive",
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.91 Safari/537.36",
"Accept":"image/webp,image/apng,image/*,*/*;q=0.8",
"Referer":"http://auction.artron.net/paimai-art5113610001/",
"Accept-Encoding":"gzip, deflate",
"Accept-Language":"zh-CN,zh;q=0.8",
}
proxies = r.get(str(random.randint(1,10)))
proxies = {"http": "http://" + str(proxies)}
print "使用的代理是:",proxies
try:
response = requests.get(url=source_url, headers=headers,verify=False,proxies=proxies,timeout=15)
if response.status_code == 200:
if not os.path.exists(dir_path):
os.makedirs(dir_path)
total_path = dir_path + '/' + file_name
with open(total_path, 'wb') as f:
for chunk in response.iter_content(1024):
f.write(chunk)
print "图片保存到本地"
return "1"
else:
print "图片没有保存到本地"
return "0"
except Exception as e:
print e
if maxQuests > 0 and response.status_code != 200:
save_img(source_url, dir_path, file_name, maxQuests-1)
def getUpdataImage(item):
item_imgurl = item['item_imgurl']
url = item_imgurl
item_href = item_imgurl
print "正在采集的 url 是",url
filenamelist = url.split('/')
filename1 = filenamelist[len(filenamelist) - 4]
filename2 = filenamelist[len(filenamelist) - 3]
filename3 = filenamelist[len(filenamelist) - 2]
filename4 = filenamelist[len(filenamelist) - 1]
filename = filename1 + "_" + filename2 + "_" + filename3 + "_" + filename4
filenamestr = filename.replace('.jpg', '')
filenamestr = filenamestr.replace('.JPG', '')
filenamestr = filenamestr.replace('.JPEG', '')
filenamestr = filenamestr.replace('.jpeg', '')
filenamestr = filenamestr.replace('.png', '')
filenamestr = filenamestr.replace('.bmp', '')
filenamestr = filenamestr.replace('.tif', '')
filenamestr = filenamestr.replace('.gif', '')
localpath = 'G:/helloworld/' + filenamestr
save_localpath = localpath + "/" + filename
print "图片保存路径是:",save_localpath
try:
result = save_img(url, localpath, filename,item_href)
if result == "1":
print "图片采集成功"
else:
print "图片采集失败"
except IOError:
pass
if __name__ == "__main__":
time1 = time.time()
sql = """SELECT item_id, item_imgurl FROM 2017_xia_erci_pic """
resultList = select_data(sql)
print len(resultList)
pool = multiprocessing.Pool(60)
for item in resultList:
pool.apply_async(getUpdataImage, (item,))
pool.close()
pool.join()
构建高效图片爬虫

被折叠的 条评论
为什么被折叠?



