多线程爬取糗事网python3

最新推荐文章于 2021-08-19 21:38:22 发布

浩瀚云海

最新推荐文章于 2021-08-19 21:38:22 发布

阅读量355

点赞数

CC 4.0 BY-SA版权

分类专栏： python 爬虫文章标签： python threading 爬虫锁

本文链接：https://blog.youkuaiyun.com/qq_35723619/article/details/83348315

python 同时被 2 个专栏收录

28 篇文章

订阅专栏

爬虫

17 篇文章

订阅专栏

本文介绍了一个基于多线程的爬虫框架设计，利用queue进行数据交互，通过创建爬取页面和处理数据的类，实现高效的数据采集与解析。文章详细展示了如何使用线程池、队列和锁来协调多个线程的工作，确保数据的正确处理。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1.导入模块：

使用到了多线程这里使用queue进行数据交互

2.创建爬取页面的类

3.创建处理数据的类

4.创建调用函数

CRAM_EXIT = False
PARSE_EXIT = False


def main():
    pageQueue = Queue(20)
    for i in range(1, 21):
        pageQueue.put(i)
    # 采集结果
    dataQueue = Queue()
    filename = open("E://file/qiushi2.json", "a")
    # 创建锁
    lock = threading.Lock()
    # 三个采集线程的名字
    crawList = ['线程1号', '线程2号', '线程3号']
    threadcrawl = []
    # 存储三个采集线程的名字
    for threadName in crawList:

        thread = ThreadCrawl(threadName, pageQueue, dataQueue)
        thread.start()
        threadcrawl.append(thread)


    # 三个解析线程的名字
    parseList = ["解析线程1号", "解析线程2号", "解析线程3号"]
    # 存储三个解析线程
    threadparse = []
    for threadName in parseList:
        thread = ThreadParse(threadName, dataQueue, filename, lock)
        thread.start()
        threadparse.append(thread)

    while not pageQueue.empty():
        pass

    global CRAM_EXIT
    CRAM_EXIT = True

    print('pageQueue为空')

    for thread in threadcrawl:
        thread.join()
        print('1')
    while not dataQueue.empty():
        pass

    global PARSE_EXIT
    PARSE_EXIT = True

    for thread in threadparse:
        thread.join()
        print('2')

    with lock:
        # 关闭文件
        filename.close()
    print("谢谢使用！")


if __name__ == "__main__":
    main()