＜居然讲爬虫＞7-多线程爬虫

原创已于 2023-05-05 21:00:04 修改 · 384 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#爬虫 #python #信息可视化

于 2023-05-05 20:52:24 首次发布

居然讲爬虫专栏收录该内容

10 篇文章

订阅专栏

本文介绍了如何使用Python实现一个多线程爬虫，通过控制并发数量限制最大线程数，使用任务队列和缓存机制提高效率，以及处理HTTP请求中的异常和错误，确保爬虫程序的稳定运行。

Python 多线程爬虫是一种常见的网络爬取技术，可以利用多个线程同时进行数据获取，提高爬取效率。本文将介绍如何使用 Python 多线程实现一个简单的网页爬虫。

接下来，我们将编写一个简单的多线程爬虫，该爬虫用于获取百度首页的 HTML 内容，并输出获取到的内容和线程信息。

import threading
import requests
from lxml import etree

def get_page():
    # 获取百度首页的 HTML 页面
    url = 'https://www.baidu.com'
    response = requests.get(url)
    return response.text

def parse_html(html):
    # 解析 HTML 页面, 打印 title 标签内容
    root = etree.HTML(html)
    title = root.xpath('//title/text()')[0]
    print('title: ', title)

def crawl():
    # 爬虫任务函数
    tid = threading.current_thread().ident
    print('Thread {} is crawling...'.format(tid))
    html = get_page()
    parse_html(html)

if __name__ == '__main__':
    # 创建 5 个线程，启动爬虫任务
    threads = []
    for i in range(5):
        t = threading.Thread(target=crawl)
        threads.append(t)
        t.start()

    # 等待所有线程结束
    for t in threads:
        t.join()

在上面的代码中，我们首先定义了三个函数：

get_page：用于获取百度首页的 HTML 页面。
parse_html：用于解析 HTML 页面，提取 title 标签内容。
crawl：用于执行爬虫任务，包括发送请求、解析页面等操作。

然后，我们创建了 5 个线程，并启动了爬虫任务。每个线程都会调用 crawl 函数执行任务，并输出当前线程编号和 title 内容。

最后，我们使用 join 方法等待所有线程结束，并打印完成信息。

我们可以完善我们写的多线程爬虫程序

控制并发数量
为了避免同时启动过多的线程导致系统资源占用过高，我们可以通过控制并发数量限制最大线程数。可以使用 Python 的内置队列模块 queue 来实现任务队列，从而控制并发数量。

from queue import Queue

# 创建任务队列和工作线程，并设置最大线程数为 10
task_queue = Queue()
threads = []
max_threads = 10

for i in range(max_threads):
    t = threading.Thread(target=worker)
    threads.append(t)
    t.start()

# 添加任务到队列中
for url in urls:
    task_queue.put(url)

# 等待所有任务完成
task_queue.join()

# 终止工作线程
for t in threads:
    t.stop()

缓存请求结果
为了提高爬取效率，我们可以使用缓存机制来保存已经获取过的页面内容，避免重复请求同一个 URL。可以使用 Python 的字典数据结构来实现缓存机制，将 URL 作为键，页面内容作为值。

cache = {}

def get_page(url):
    # 检查是否存在缓存，如果有，直接返回缓存数据
    if url in cache:
        return cache[url]

    # 否则发送 HTTP 请求，获取页面内容，并加入缓存
    response = requests.get(url)
    html = response.text
    cache[url] = html
    return html

处理异常和错误
在实际爬虫程序中，可能会遇到各种异常和错误，例如 DNS 解析失败、HTTP 请求超时等。为了保证程序的健壮性和稳定性，我们需要处理这些异常并进行相应的错误处理。

def crawl(url):
    try:
        # 发送 HTTP 请求
        html = get_page(url)

        # 解析页面内容
        parse_html(html)

    except Exception as e:
        print('Error:', e)

综上所述，这些功能的添加可以使爬虫程序更加健壮、灵活和实用。完整的 Python 多线程爬虫代码如下：

import os
import io
import threading
import requests
from lxml import etree
from queue import Queue

# 设置最大线程数和请求头部信息
max_threads = 10
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

# 创建任务队列和工作线程
task_queue = Queue()
threads = []

# 缓存机制，保存已经获取过的页面内容
cache = {}

def get_page(url):
    # 检查是否存在缓存，如果有，直接返回缓存数据
    if url in cache:
        return cache[url]

    # 否则发送 HTTP 请求，获取页面内容，并加入缓存
    response = requests.get(url, headers=headers)
    html = response.text
    cache[url] = html
    return html

def parse_html(html):
    # 解析 HTML 页面, 输出 title 标签内容
    root = etree.HTML(html)
    title = root.xpath('//title/text()')[0]
    print('title: ', title)

def crawl(url):
    try:
        # 发送 HTTP 请求
        html = get_page(url)
        # 解析页面内容
        parse_html(html)
    except Exception as e:
        print('Error:', e)

def worker():
    while True:
        # 从队列中获取任务，并执行
        url = task_queue.get()
        crawl(url)
        task_queue.task_done()

if __name__ == '__main__':
    # 添加任务到队列中
    urls = ['https://www.baidu.com', 'https://www.qq.com', 'https://www.sina.com.cn']
    for url in urls:
        task_queue.put(url)

    # 创建工作线程，并启动
    for i in range(max_threads):
        t = threading.Thread(target=worker)
        threads.append(t)
        t.start()

    # 等待所有任务完成
    task_queue.join()

    # 终止工作线程
    for t in threads:
        t.stop()

在上述代码中，我们实现了以下功能：