python使用re, os, httplib, urllib批量下载网页上的图片

最新推荐文章于 2023-09-20 14:49:47 发布

潇垚

最新推荐文章于 2023-09-20 14:49:47 发布

阅读量938

点赞数

CC 4.0 BY-SA版权

分类专栏： python学习文章标签： http协议正则表达式 python 批量图片批量下载

本文链接：https://blog.youkuaiyun.com/u010872995/article/details/46271505

python学习专栏收录该内容

18 篇文章

订阅专栏

import re ,httplib, urllib, os

conn = httplib.HTTPConnection("www.njupt.edu.cn")
"""
   下面这行代码可以用httplib提供的其他几个方法实现
   但是要求读者对http协议有一定的了解，示例如下：
   dataBody = urllib.urlencode({'@number': 12524, '@type': 'issue', '@action': 'show'})
   conn.putrequest("POST", "/")
   conn.putheader("HOST", "www.njupt.edu.cn")
   conn.putheader("Content-Type", "application/x-www-form-urlencoded")
   conn.putheader("Content-Length", str(len(databody)))
   conn.putheader("Accept", "text/plain")
   conn.endheaders()
   conn.send(dataBody)
"""

"""
request()方法也可以携带头部字段，携带的头部字段会自动覆盖同名的头部字段
对于POST方法的Content-Length字段由方法自己计算添加，也可以由我们自己添加
例如：
params = urllib.urlencode({'@number': 12524, '@type': 'issue', '@action': 'show'})
headers = {"Content-type": "application/x-www-form-urlencoded",
           "Content-Length": str(len(params))
           "Accept": "text/plain"}
conn.request("POST", "/", params, headers)
"""

"""
   上述两种方法没有明显差别
"""

conn.request("GET", "/")
response = conn.getresponse()
htmlPage = response.read()
conn.close()
"""
   利用re模块正则表达式提取html网页中img地址（对html不熟，可能这块会有问题）
   推荐书籍《精通正则表达式》
   
"""
reg = re.compile("<img *src=\"([^\"]+)\"")
"""
   提取网页里面的图片链接，可能会有重复的地址
"""
pics = reg.findall(htmlPage)

dirname = "/home/myhome/pic"
os.chdir(dirname)
reg1 = re.compile("http://")

"""
   用来提取文件名
"""
reg2 = re.compile("/([^/]+\.[a-zA-Z]{3,4})$")
"""
   tuple(set(pics))用于去重
"""
for pic in tuple(set(pics)):
    if reg1.match(pic):
        filename = reg2.findall(pic)[0]
        urllib.urlretrieve(pic, filename)
        print(filename+" download completed....")
    else:
        pic = "http://www.njupt.edu.cn"+pic
        filename = reg2.findall(pic)[0]
        urllib.urlretrieve(pic, filename)
        print(filename+" download completed....")

另外补充几点：

（1）在后面下载图片文件时，我们如果自己构造HTTP header，那么切记要替换图片对应的HOST头值

（2）我们可以使用httpllib.request和len = int(HTTPResponse.getheader("Content-Length"))、HTTPResponse.read(len)自己读取图片数据再写入磁盘，这样我们可以重复利用一次tcp连接，下载多个图片文件，不过在此之前我们必须对所属于同一域名的图片文件进行归类再下载