一、背景介绍
上一篇《下载极客漫画——Beautiful Soup实用案例》中采用的是单线程下载网站所有漫画(共3千多张),速度慢效率低。本篇将对上一篇代码进行优化,采用多线程任务完成下载,并指定下载漫画的页码范围(比如第1张到100张),提高程序运行效率。
二、修改部分
1. 定义downloadXkcd()函数:
def downloadXkcd(startComic, endComic):
for urlNumber in range(startComic, endComic):
print('Downloading page https://xkcd.com/%s...' % urlNumber)
res = requests.get('https://xkcd.com/%s' % urlNumber)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
comicElem = soup.select('#comic img')
if not comicElem:
print('Could not find comic image.')
else:
comicUrl = 'https:' + comicElem[0].get('src')
print('Downloading image %s...' % comicUrl)
res = requests.get(comicUrl)
res.raise_for_status()
imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl)), 'wb')
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()
说明:
导入需要的模块后,创建⼀个目录来保存漫画,然后开始定义downloadXkcd() 。循环遍历指定范围中的所有编号,并下载每个页面。用Beautiful Soup查看每⼀页的HTML 以找到漫画图像。如果页面上没有当前漫画图像,就输出⼀条消息;否则,取得图片的URL,并下载图像。最后,将图像保存到我们创建的目录中。
2. 创建多线程
downloadThreads = []
for i in range(0, 100, 10):
start = i
end = i + 9
if start == 0:
start = 1 # There is no comic 0, so set it to 1.
downloadThread = threading.Thread(target=downloadXkcd, args=(start, end))
downloadThreads.append(downloadThread)
downloadThread.start()
说明:
首先创建了⼀个空列表downloadThreads,该列表帮助我们追踪创建的多个 Thread 对象。然后开始 for 循环。在每次循环中,利⽤threading. Thread()创建⼀个Thread对象,将它追加到列表中,并调用start(),以开始在新线程中运行downloadXkcd()。因为for循环将变量i设置为从0到100,步长为10,所以i在第⼀次迭代时为0,第二次迭代时为10,第三次迭代时为20,以此类推。因为我们将args=(start,end)传递给threading.Thread(),所以在第⼀次迭代时,传递给downloadXkcd()的两个参数将是1和9,第二次迭代是10和19,第三次迭代是20和29,以此类推。随着调⽤Thread对象的start()⽅法,新的线程开始运行downloadXkcd()中的代码,主线程将继续for循环的下⼀次迭代,并创造下⼀个线程。
3. 等待所有线程结束
for downloadThread in downloadThreads:
downloadThread.join()
print('Done.')
说明:
主线程继续正常执行,同时我们创建的其他线程来下载漫画。但是假定主线程中有⼀些代码,你希望所有下载线程完成后再执行。调用Thread对象的join()方法将阻塞,直到该线程完成。利用⼀个 for循环,来遍历downloadThreads列表中的所有Thread对象,主线程可以调⽤其他每个线程的join()方法。所有的join()调用返回后,'Done.'字符串才会输出,如果⼀个Thread对象已经完成,那么调用它的join()方法时,该方法就会立即返回。
三、完整代码
import requests, os, bs4, threading
os.makedirs('xkcd', exist_ok=True) # store comics in ./xkcd
def downloadXkcd(startComic, endComic):
for urlNumber in range(startComic, endComic):
# Download the page.
print('Downloading page https://xkcd.com/%s...' % urlNumber)
res = requests.get('https://xkcd.com/%s' % urlNumber)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
# Find the URL of the comic image.
comicElem = soup.select('#comic img')
if not comicElem:
print('Could not find comic image.')
else:
comicUrl = 'https:' + comicElem[0].get('src')
# Download the image.
print('Downloading image %s...' % comicUrl)
res = requests.get(comicUrl)
res.raise_for_status()
# Save the image to ./xkcd.
imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl)), 'wb')
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()
def main():
# Create and start the Thread objects.
downloadThreads = [] # a list of all the Thread objects
for i in range(0, 100, 10): # loops 10 times, creates 10 threads
start = i
end = i + 9
if start == 0:
start = 1 # There is no comic 0, so set it to 1.
downloadThread = threading.Thread(target=downloadXkcd, args=(start, end))
downloadThreads.append(downloadThread)
downloadThread.start()
# Wait for all threads to end.
for downloadThread in downloadThreads:
downloadThread.join()
print('Done.')
if __name__ == '__main__':
main()