C++无锁编程——无锁队列(lock-free queue)

原创

已于 2023-09-16 17:52:15 修改 · 7.6k 阅读

31 ·

CC 4.0 BY-SA版权

文章标签：

#数据结构 #c++ #多线程 #无锁编程

于 2023-07-16 17:00:33 首次发布

本文介绍了如何在C++中实现无锁队列，包括单生产者-单消费者（SPSC）和多生产者-多消费者（MPMC）模型。关键点在于利用原子变量和内存顺序来避免数据竞争。文章详细分析了代码实现，包括对节点指针的原子处理、引用计数以及内存顺序的放宽，以提高并发性能。测试代码展示了队列在不同线程场景下的工作情况。

C++无锁编程——无锁队列(lock-free queue)

贺志国
2023.7.11

上一篇博客给出了最简单的C++数据结构——栈的几种无锁实现方法。队列的挑战与栈的有些不同，因为Push()和Pop()函数在队列中操作的不是同一个地方，同步的需求就不一样。需要保证对一端的修改是正确的，且对另一端是可见的。因此队列需要两个Node指针：head_和tail_。这两个指针都是原子变量，从而可在不加锁的情形下，给多个线程同时访问。
队列示意
在我们的实现中，如果head_和tail_指针指向同一个节点（称之为哑节点，dummy node），则认为队列为空。

首先来分析单生产者/单消费者的情形。

一、单生产者-单消费者模型下的无锁队列

单生产者/单消费者模型就是指，在某一时刻，最多只存在一个线程调用Push()函数，最多只存在一个线程调用Pop()函数。该情形下的代码（文件命名为 lock_free_queue.h）如下：

#pragma once

#include <atomic>
#include <memory>

template <typename T>
class LockFreeQueue {
   
   
 public:
  LockFreeQueue() : head_(new Node), tail_(head_.load()) {
   
   }
  ~LockFreeQueue() {
   
   
    while (Node* old_head = head_.load()) {
   
   
      head_.store(old_head->next);
      delete old_head;
    }
  }

  LockFreeQueue(const LockFreeQueue& other) = delete;
  LockFreeQueue& operator=(const LockFreeQueue& other) = delete;

  bool IsEmpty() const {
   
    return head_.load() == tail_.load(); }

  void Push(const T& data) {
   
   
    auto new_data = std::make_shared<T>(data);
    Node* p = new Node;             // 3
    Node* old_tail = tail_.load();  // 4
    old_tail->data.swap(new_data);  // 5
    old_tail->next = p;             // 6
    tail_.store(p);                 // 7
  }

  std::shared_ptr<T> Pop() {
   
   
    Node* old_head = PopHead();
    if (old_head == nullptr) {
   
   
      return std::shared_ptr<T>();
    }

    const std::shared_ptr<T> res(old_head->data);  // 2
    delete old_head;
    return res;
  }

 private:
  // If the struct definition of Node is placed in the private data member
  // field where 'head_' is defined, the following compilation error will occur:
  //
  // error: 'Node' has not been declared ...
  //
  // It should be a bug of the compiler. The struct definition of Node is put in
  // front of the private member function `DeleteNodes` to eliminate this error.
  struct Node {
   
   
    // std::make_shared does not throw an exception.
    Node() : data(nullptr), next(nullptr) {
   
   }

    std::shared_ptr<T> data;
    Node* next;
  };

 private:
  Node* PopHead() {
   
   
    Node* old_head = head_.load();
    if (old_head == tail_.load()) {
   
     // 1
      return nullptr;
    }
    head_.store(old_head->next);
    return old_head;
  }

 private:
  std::atomic<Node*> head_;
  std::atomic<Node*> tail_;
};

一眼望去，这个实现没什么毛病，当只有一个线程调用Push()和Pop()时，这种情况下队列一点毛病没有。Push()和Pop()之间的先行(happens-before )关系非常重要，直接关系到能否安全地获取到队列中的数据。对尾部节点tail_的存储⑦（对应于上述代码片段中的注释// 7，下同）同步（synchronizes with）于对tail_的加载①，存储之前节点的data指针⑤先行(happens-before )于存储tail_。并且，加载tail_先行于加载data指针②，所以对data的存储要先行于加载，一切都没问题。因此，这是一个完美的单生产者/单消费者(SPSC, single-producer, single-consume)队列。
问题在于当多线程对Push()和Pop()并发调用。先看一下Push()：如果有两个线程并发调用Push()，会新分配两个节点作为虚拟节点③，也会读取到相同的tail_值④，因此也会同时修改同一个节点，同时设置data和next指针⑤⑥，存在明显的数据竞争！
PopHead()函数也有类似的问题。当有两个线程并发的调用这个函数时，这两个线程就会读取到同一个head_，并且会通过next指针去修改旧值。两个线程都能索引到同一个节点——真是一场灾难！不仅要保证只有一个Pop()线程可以访问给定项，还要保证其他线程在读取head_时，可以安全的访问节点中的next，这就是和无锁栈中Pop()一样的问题了。
Pop()的问题假设已解决，那么Push()呢？问题在于为了获取Push()和Pop()间的先行关系，就需要在为虚拟节点设置数据项前，更新tail_指针。并发访问Push()时，因为每个线程所读取到的是同一个tail_，所以线程会进行竞争。

说明：
先行(happens-before )与同步（synchronizes with）是使用原子变量在线程间同步内存数据的两个重要关系。
Happens-before（先行）
Regardless of threads, evaluation A happens-before evaluation B if any of the following is true: 1) A is sequenced-before B; 2) A inter-thread happens before B. The implementation is required to ensure that the happens-before relation is acyclic, by introducing additional synchronization if necessary (it can only be necessary if a consume operation is involved). If one evaluation modifies a memory location, and the other reads or modifies the same memory location, and if at least one of the evaluations is not an atomic operation, the behavior of the program is undefined (the program has a data race) unless there exists a happens-before relationship between these two evaluations.
(无关乎线程，若下列任一为真，则求值 A 先行于求值 B ：1) A 先序于 B；2) A 线程间先发生于 B。要求实现确保先发生于关系是非循环的，若有必要则引入额外的同步（若引入消费操作，它才可能为必要）。若一次求值修改一个内存位置，而其他求值读或修改同一内存位置，且至少一个求值不是原子操作，则程序的行为未定义（程序有数据竞争），除非这两个求值之间存在先行关系。)
Synchronizes with（同步）
If an atomic store in thread A is a release operation, an atomic load in thread B from the same variable is an acquire operation, and the load in thread B reads a value written by the store in thread A, then the store in thread A synchronizes-with the load in thread B. Also, some library calls may be defined to synchronize-with other library calls on other threads.
(如果在线程A上的一个原子存储是释放操作，在线程B上的对相同变量的一个原子加载是获得操作，且线程B上的加载读取由线程A上的存储写入的值，则线程A上的存储同步于线程B上的加载。此外，某些库调用也可能定义为同步于其它线程上的其它库调用。)

二、多生产者-多消费者模型下的无锁队列

2.1 不考虑放宽内存顺序

为了解决多个线程同时访问产生的数据竞争问题，可以让Node节点中的data指针原子化，通过“比较/交换”操作对其进行设置。如果“比较/交换”成功，就说明能获取tail_，并能够安全的对其next指针进行设置，也就是更新tail_。因为有其他线程对数据进行了存储，所以会导致“比较/交换”操作的失败，这时就要重新读取tail_，重新循环。如果原子操作对于std::shared_ptr<>是无锁的，那么就基本结束了。然而，目前在多数平台中std::shared_ptr<>不是无锁的，这就需要一个替代方案：让Pop()函数返回std::unique_ptr<>，并且将数据作为普通指针存储在队列中。这就需要队列支持存储std::atomic<T*>类型，对于compare_exchange_strong()的调用就很有必要了。使用类似于无锁栈中的引用计数模式，来解决多线程对Pop()和Push()的访问。具体做法是：对每个节点使用两个引用计数：内部计数和外部计数。两个值的总和就是对这个节点的引用数。外部记数与节点指针绑定在一起，节点指针每次被线程读到时，外部计数加1。当线程结束对节点的访问时，内部计数减1。当节点（内部包含节点指针和绑定在一起的外部计数）不被外部线程访问时，将内部计数与外部计数-2相加并将结果重新赋值给内部计数，同时丢弃外部计数。一旦内部计数等于0，表明当前节点没有被外部线程访问，可安全地将节点删除。与无锁栈的区别是，队列中包含head_和tail_两个节点，因此需要两个引用计数器来维护节点的内部计数，即使用
std::atomic<NodeCounter> counter 替换 std::atomic<int> internal_count（结构体NodeCounter的定义和说明见后文说明）。下面是示例代码（文件命名为 lock_free_queue.h，示例来源于C++ Concurrency In Action, 2ed 2019，修复了其中的bug）：

#pragma once

#include <atomic>
#include <memory>

template <typename T>
class LockFreeQueue {
   
   
 public:
  LockFreeQueue() : head_(CountedNodePtr(new Node, 1)), tail_(head_.load()) {
   
   }
  ~LockFreeQueue();

  // Copy construct and assignment, move construct and assignment are
  // prohibited.
  LockFreeQueue(const LockFreeQueue& other) = delete;
  LockFreeQueue& operator=(const LockFreeQueue& other) = delete;
  LockFreeQueue(LockFreeQueue&& other) = delete;
  LockFreeQueue& operator=(LockFreeQueue&& other) = delete;

  bool IsEmpty() const {
   
    return head_.load().ptr == tail_.load().ptr; }
  bool IsLockFree() const {
   
   
    return std::atomic<CountedNodePtr>::is_always_lock_free;
  }

  void Push(const T& data);
  std::unique_ptr<T> Pop();

 private:
  // Forward class declaration
  struct Node;

  struct CountedNodePtr {
   
   
    explicit CountedNodePtr(Node* input_ptr = nullptr,
                            uint16_t input_external_count = 0)
        : ptr(reinterpret_cast<uint64_t>(input_ptr)),
          external_count(input_external_count) {
   
   }

    // We know that the platform has spare bits in a pointer (for example,
    // because the address space is only 48 bits but a pointer is 64 bits), we
    // can store the count inside the spare bits of the pointer to fit it all
    // back in a single machine word. Keeping the structure within a machine
    // word makes it more likely that the atomic operations can be lock-free on
    // many platforms.
    uint64_t ptr : 48;
    uint16_t external_count : 16;
  };

  struct NodeCounter {
   
   
    NodeCounter() : internal_count(0), external_counters(0) {
   
   }
    NodeCounter(const uint32_t input_internal_count,
                const uint8_t input_external_counters)
        : internal_count(input_internal_count),
          external_counters(input_external_counters) {
   
   }

    // external_counters occupies only 2 bits, where the maximum value stored
    // is 3. Note that we need only 2 bits for the external_counters because
    // there are at most two such counters. By using a bit field for this and
    // specifying internal_count as a 30-bit value, we keep the total counter
    // size to 32 bits. This gives us plenty of scope for large internal count
    // values while ensuring that the whole structure fits inside a machine word
    // on 32-bit and 64-bit machines. It’s important to update these counts
    // together as a single entity in order to avoid race conditions. Keeping
    // the structure within a machine word makes it more likely that the atomic
    // operations can be lock-free on many platforms.
    uint32_t internal_count : 30;
    uint8_t external_counters : 2;
  };

  struct Node {
   
   
    // There are only two counters in Node (counter and next), so the initial
    // value of external_counters is 2.
    Node()
        : data(nullptr), counter(NodeCounter(0, 2)), next(CountedNodePtr()) {
   
   }
    ~Node();
    void ReleaseRef();

    std::atomic<T*> data;
    std::atomic<NodeCounter> counter;
    std::atomic<CountedNodePtr> next;
  };

 private:
  static void IncreaseExternalCount(std::atomic<CountedNodePtr>* atomic_node,
                                    CountedNodePtr* old_node);
  static void FreeExternalCounter(CountedNodePtr* old_node);
  void SetNewTail(const CountedNodePtr& new_tail, CountedNodePtr* old_tail);

 private:
  std::atomic<CountedNodePtr> head_;
  std::atomic<CountedNodePtr> tail_;
};

template <typename T>
LockFreeQueue<T>::Node::~Node() {
   
   
  if (data.load() != nullptr) {
   
   
    T* old_data = data.exchange(nullptr);
    if (old_data != nullptr) {
   
   
      delete old_data;
    }
  }
}

template <typename T>
void LockFreeQueue<T>