Python——threading多线程应用的案例

置顶穿梭的编织者

于 2025-04-09 08:51:48 发布

阅读量588

点赞数 12

文章标签： python 爬虫

本文链接：https://blog.youkuaiyun.com/kouweizhu/article/details/146244375

版权

一、背景介绍

上一篇《下载极客漫画——Beautiful Soup实用案例》中采用的是单线程下载网站所有漫画（共3千多张），速度慢效率低。本篇将对上一篇代码进行优化，采用多线程任务完成下载，并指定下载漫画的页码范围（比如第1张到100张），提高程序运行效率。

二、修改部分

1. 定义downloadXkcd()函数：

def downloadXkcd(startComic, endComic):
    for urlNumber in range(startComic, endComic):
        print('Downloading page https://xkcd.com/%s...' % urlNumber)
        res = requests.get('https://xkcd.com/%s' % urlNumber)
        res.raise_for_status()
        soup = bs4.BeautifulSoup(res.text, 'html.parser')
        comicElem = soup.select('#comic img')
        if not comicElem:
            print('Could not find comic image.')
        else:
            comicUrl = 'https:' + comicElem[0].get('src')
            print('Downloading image %s...' % comicUrl)
            res = requests.get(comicUrl)
            res.raise_for_status()
        imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl)), 'wb')
        for chunk in res.iter_content(100000):
            imageFile.write(chunk)
        imageFile.close()

说明：

导入需要的模块后，创建⼀个目录来保存漫画，然后开始定义downloadXkcd() 。循环遍历指定范围中的所有编号，并下载每个页面。用Beautiful Soup查看每⼀页的HTML 以找到漫画图像。如果页面上没有当前漫画图像，就输出⼀条消息；否则，取得图片的URL，并下载图像。最后，将图像保存到我们创建的目录中。

2. 创建多线程

downloadThreads = []              
for i in range(0, 100, 10):     
    start = i
    end = i + 9
    if start == 0:
        start = 1 # There is no comic 0, so set it to 1.
    downloadThread = threading.Thread(target=downloadXkcd, args=(start, end))
    downloadThreads.append(downloadThread)
    downloadThread.start()

说明：

首先创建了⼀个空列表downloadThreads，该列表帮助我们追踪创建的多个 Thread 对象。然后开始 for 循环。在每次循环中，利⽤threading. Thread()创建⼀个Thread对象，将它追加到列表中，并调用start()，以开始在新线程中运行downloadXkcd()。因为for循环将变量i设置为从0到100，步长为10，所以i在第⼀次迭代时为0，第二次迭代时为10，第三次迭代时为20，以此类推。因为我们将args=(start,end)传递给threading.Thread()，所以在第⼀次迭代时，传递给downloadXkcd()的两个参数将是1和9，第二次迭代是10和19，第三次迭代是20和29，以此类推。随着调⽤Thread对象的start()⽅法，新的线程开始运行downloadXkcd()中的代码，主线程将继续for循环的下⼀次迭代，并创造下⼀个线程。

3. 等待所有线程结束

for downloadThread in downloadThreads:
    downloadThread.join()
print('Done.')

说明：

主线程继续正常执行，同时我们创建的其他线程来下载漫画。但是假定主线程中有⼀些代码，你希望所有下载线程完成后再执行。调用Thread对象的join()方法将阻塞，直到该线程完成。利用⼀个 for循环，来遍历downloadThreads列表中的所有Thread对象，主线程可以调⽤其他每个线程的join()方法。所有的join()调用返回后，'Done.'字符串才会输出，如果⼀个Thread对象已经完成，那么调用它的join()方法时，该方法就会立即返回。

三、完整代码

import requests, os, bs4, threading

os.makedirs('xkcd', exist_ok=True)  # store comics in ./xkcd

def downloadXkcd(startComic, endComic):
    for urlNumber in range(startComic, endComic):
        # Download the page.
        print('Downloading page https://xkcd.com/%s...' % urlNumber)
        res = requests.get('https://xkcd.com/%s' % urlNumber)
        res.raise_for_status()
        soup = bs4.BeautifulSoup(res.text, 'html.parser')
        # Find the URL of the comic image.
        comicElem = soup.select('#comic img')
        if not comicElem:
            print('Could not find comic image.')
        else:
            comicUrl = 'https:' + comicElem[0].get('src')
            # Download the image.
            print('Downloading image %s...' % comicUrl)
            res = requests.get(comicUrl)
            res.raise_for_status()

        # Save the image to ./xkcd.
        imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl)), 'wb')
        for chunk in res.iter_content(100000):
            imageFile.write(chunk)
        imageFile.close()

def main():
    # Create and start the Thread objects.
    downloadThreads = []               # a list of all the Thread objects
    for i in range(0, 100, 10):      # loops 10 times, creates 10 threads
        start = i
        end = i + 9
        if start == 0:
            start = 1 # There is no comic 0, so set it to 1.
        downloadThread = threading.Thread(target=downloadXkcd, args=(start, end))
        downloadThreads.append(downloadThread)
        downloadThread.start()

    # Wait for all threads to end.
    for downloadThread in downloadThreads:
        downloadThread.join()
    print('Done.')

if __name__ == '__main__':
    main()