CUDA C++ Programming Guide 7.26. Asynchronous Barrier

图波列夫

已于 2024-09-11 09:31:57 修改

阅读量624

点赞数 2

于 2024-09-11 09:30:16 首次发布

原文链接：https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#asynchronous-barrier

版权

GPU 同时被 3 个专栏收录

20 篇文章

订阅专栏

NVIDIA

12 篇文章

订阅专栏

CUDA

2 篇文章

订阅专栏

以下内容翻译自：CUDA C++ Programming Guide 中的 7.26. Asynchronous Barrier

7.26. Asynchronous Barrier

NVIDIA C++ 标准库引入了 std::barrier 的 GPU 实现。除了std::barrier的实现之外，该库还提供了扩展功能，允许用户指定屏障对象的作用范围。屏障 API 的作用范围在 Thread Scopes 中有详细说明。

计算能力为8.0或更高的设备为屏障操作提供了硬件加速，并将其与memcpy_async 功能集成。
在计算能力低于8.0但不低于7.0的设备上，这些屏障可以使用但没有硬件加速。

nvcuda::experimental::awbarrier已被弃用，推荐使用cuda::barrier来替代。

7.26.1. Simple Synchronization Pattern

在没有到达（等待）屏障的情况下，同步可以通过使用__syncthreads()（同步一个块中的所有线程），或在 Cooperative Groups 中使用group.sync()来实现。

#include  <cooperative_groups.h>

__global__  void  simple_sync(int  iteration_count)  {
  auto  block  =  cooperative_groups::this_thread_block();

  for  (int  i  =  0;  i  <  iteration_count;  ++i)  {
  /* code before arrive */
  block.sync();  /* wait for all threads to arrive here */
  /* code after wait */
  }
}

线程在同步点（block.sync()）处被阻塞，直到所有线程都到达同步点。此外，在同步点之前发生的内存更新会在同步点之后对块中的所有线程可见，这相当于 atomic_thread_fence(memory_order_seq_cst, thread_scope_block) 以及sync操作。

这个模式包含三个阶段：

同步前的代码执行内存更新，这些更新将在同步之后被读取。
同步点
同步点之后的代码，可看到同步点之前发生的内存更新。

7.26.2. Temporal Splitting and Five Stages of Synchronization

使用std::barrier的时间分割同步模式如下：

#include  <cuda/barrier>
#include  <cooperative_groups.h>

__device__  void  compute(float*  data,  int  curr_iteration);

__global__  void  split_arrive_wait(int  iteration_count,  float  *data)  {
  using  barrier  =  cuda::barrier<cuda::thread_scope_block>;
  __shared__  barrier  bar;
  auto  block  =  cooperative_groups::this_thread_block();

  if  (block.thread_rank()  ==  0)  {
  init(&bar,  block.size());  // Initialize the barrier with expected arrival count
  }
  block.sync();

  for  (int  curr_iter  =  0;  curr_iter  <  iteration_count;  ++curr_iter)  {
  /* code before arrive */
  barrier::arrival_token  token  =  bar.arrive();  /* this thread arrives. Arrival does not block a thread */
  compute(data,  curr_iter);
  bar.wait(std::move(token));  /* wait for all threads participating in the barrier to complete bar.arrive()*/
  /* code after wait */
  }
}

该模式将同步点( block.sync() )拆分为到达点( bar.arrive())和等待点( bar.wait(std::move(token)))。

线程在第一次调用bar.arrive()时开始参与cuda::barrier。
当线程调用bar.wait(std::move(token))时，它将被阻塞，直到参与的线程完成了预期数量的bar.arrive()调用，这个数量是通过传递给init()的预期到达计数参数来指定的。

每个参与线程在调用bar.arrive()之前的内存更新保证在它们调用bar.wait(std::move(token))之后对所有参与线程可见。

注意：调用bar.arrive()不会阻塞线程，它可以继续执行其他工作，这些工作不依赖于在其他参与线程调用bar.arrive()之前发生的内存更新。

“到达然后等待”模式有五个阶段，这些阶段可能会迭代重复：

到达点前的代码执行内存更新，这些更新将在等待之后被读取。
到达点具有隐式内存栅栏（即，相当于 atomic_thread_fence(memory_order_seq_cst, thread_scope_block)）。
到达和等待之间的代码。
等待点。
等待之后的代码，可以看到在到达之前执行的更新。

7.26.3. Bootstrap Initialization, Expected Arrival Count, and Participation

在任何线程开始参与cuda::barrier之前，必须进行初始化。

#include  <cuda/barrier>
#include  <cooperative_groups.h>

__global__  void  init_barrier()  {
  __shared__  cuda::barrier<cuda::thread_scope_block>  bar;
  auto  block  =  cooperative_groups::this_thread_block();

  if  (block.thread_rank()  ==  0)  {
  init(&bar,  block.size());  // Single thread initializes the total expected arrival count.
  }
  block.sync();
}

在任何线程参与cuda::barrier之前，必须使用init()对屏障进行初始化，并设置预期到达次数（本例中为block.size()）。初始化必须在任何线程调用bar.arrive()之前完成。这带来了自举的挑战，因为线程在参与cuda::barrier之前必须同步，但线程创建cuda::barrier是为了进行同步。在本例中，参与的线程属于一个协作组，并使用block.sync()来引导初始化。在这个例子中，整个线程块都参与了初始化，因此也可以使用__syncthreads()。

init()的第二个参数是预期到达次数，即在参与线程调用bar.wait(std::move(token))解除阻塞之前，参与线程调用bar.arrive()的次数。

在前面的示例中，cuda::barrier使用线程块中的线程数（即 cooperative_groups::this_thread_block().size()）进行初始化，线程块中的所有线程都参与了这个屏障。

cuda::barrier可以灵活地指定线程的参与方式（拆分为到达和等待）以及哪些线程参与。相比之下，__syncthreads()或协作组的this_thread_block.sync()适用于整个线程块，而__syncwarp(mask)适用于指定的 warp 子集。如果用户希望同步整个线程块或整个 warp，出于性能考虑，我们建议分别使用 __syncthreads()和__syncwarp(mask)。

7.26.4. A Barrier’s Phase: Arrival, Countdown, Completion, and Reset

当参与线程调用bar.arrive()时，cuda::barrier从预期到达计数倒计时到零。当倒计时为零时，当前阶段的cuda::barrier完成。当最后一次调用bar.arrive()使倒计时达到零时，倒计时会自动且原子地重置。重置将倒计时设置为预期的到达次数，并将cuda::barrier移动到下一阶段。

从token=bar.arrive()返回的cuda::barrier::arrival_token类令牌对象token与屏障的当前阶段相关联：

当调用bar.wait(std::move(token))时，如果cuda::barrier处于当前阶段，即令牌关联的阶段与cuda::barrier的阶段相匹配，则阻塞调用线程；
如果在调用bar.wait(std::move(token))之前阶段已经前进（因为倒计时已达到零），则线程不会阻塞；
如果线程在bar.wait(std::move(token))中阻塞时阶段前进，则线程将解除阻塞。

了解重置何时可能发生、何时不可能发生至关重要，尤其是在复杂的到达（等待）同步模式中。

线程对token=bar.arrive()和bar.wait(std::move(token))的调用必须按顺序进行，以确保token=bar.arrive()发生在cuda::barrier的当前阶段，而bar.wait(std::move(token))发生在相同阶段或下一个阶段。
线程对bar.arrive()的调用必须在屏障计数器非零时进行。屏障初始化后，如果线程的bar.arrive()调用使倒计时归零，则必须先调用一次bar.wait(std::move(token))，然后才能重新使用该屏障，用于后续bar.arrive()调用。
调用bar.wait()时，必须使用当前阶段或前一阶段的token对象。对于任何其他值的token对象，其行为是未定义的。

对于简单的到达（等待）同步模式，遵守这些使用规则非常简单。

7.26.5. Spatial Partitioning (also known as Warp Specialization)

线程块可以在空间上进行划分，使得线程束专门用于执行独立的计算。这种空间分区通常用于生产者-消费者模式，其中一个线程子集生成数据，而另一个（不相交的）线程子集并发地消耗这些数据。

生产者（消费者）空间分区模式需要两个单边同步，以管理生产者和消费者之间的数据缓冲区：

生产者线程等待消费者线程发出缓冲区已准备就绪的信号；然而，消费者线程不等待这一信号。
消费者线程等待生产者线程发出缓冲区已填充完成的信号；然而，生产者线程不等待该信号。
为了实现完全的生产者（消费者）并发，这种模式（至少）需要双缓冲，每个缓冲区需要两个cuda::barrier。

Producer	Consumer
等待缓冲区准备好以供填充	发出缓冲区已准备就绪的信号
生成数据并填充缓冲区
发出信号表示缓冲区已填充	等待缓冲区填充完成
	消费已填充缓冲区中的数据

#include  <cuda/barrier>
#include  <cooperative_groups.h>

using  barrier  =  cuda::barrier<cuda::thread_scope_block>;

__device__  void  producer(barrier  ready[],  barrier  filled[],  float*  buffer,  float*  in,  int  N,  int  buffer_len)
{
  for  (int  i  =  0;  i  <  (N/buffer_len);  ++i)  {
  ready[i%2].arrive_and_wait();  /* wait for buffer_(i%2) to be ready to be filled */
  /* produce, i.e., fill in, buffer_(i%2)  */
  barrier::arrival_token  token  =  filled[i%2].arrive();  /* buffer_(i%2) is filled */
  }
}

__device__  void  consumer(barrier  ready[],  barrier  filled[],  float*  buffer,  float*  out,  int  N,  int  buffer_len)
{
  barrier::arrival_token  token1  =  ready[0].arrive();  /* buffer_0 is ready for initial fill */
  barrier::arrival_token  token2  =  ready[1].arrive();  /* buffer_1 is ready for initial fill */
  for  (int  i  =  0;  i  <  (N/buffer_len);  ++i)  {
  filled[i%2].arrive_and_wait();  /* wait for buffer_(i%2) to be filled */
  /* consume buffer_(i%2) */
  barrier::arrival_token  token  =  ready[i%2].arrive();  /* buffer_(i%2) is ready to be re-filled */
  }
}

//N is the total number of float elements in arrays in and out
__global__  void  producer_consumer_pattern(int  N,  int  buffer_len,  float*  in,  float*  out)  {

  // Shared memory buffer declared below is of size 2 * buffer_len
  // so that we can alternatively work between two buffers.
  // buffer_0 = buffer and buffer_1 = buffer + buffer_len
  __shared__  extern  float  buffer[];

  // bar[0] and bar[1] track if buffers buffer_0 and buffer_1 are ready to be filled,
  // while bar[2] and bar[3] track if buffers buffer_0 and buffer_1 are filled-in respectively
  __shared__  barrier  bar[4];

  auto  block  =  cooperative_groups::this_thread_block();
  if  (block.thread_rank()  <  4)
  init(bar  +  block.thread_rank(),  block.size());
  block.sync();

  if  (block.thread_rank()  <  warpSize)
  producer(bar,  bar+2,  buffer,  in,  N,  buffer_len);
  else
  consumer(bar,  bar+2,  buffer,  out,  N,  buffer_len);
}

在本例中，第一个线程束专用于生产者，其余的线程束专用于消费者。
所有生产者和消费者线程都参与（调用 bar.arrive() 或 bar.arrive_and_wait()）四个 cuda::barrier，因此预期到达计数等于block.size()。

生产者线程等待消费者线程发出可以填充共享内存缓冲区的信号。为了等待 cuda::barrier，生产者线程必须首先到达ready[i%2].arrive()以获取一个令牌，然后使用该令牌在ready[i%2].wait(token)上等待。为简单起见，ready[i%2].arrive_and_wait()将这些操作合并在一起。

bar.arrive_and_wait();
/* is equivalent to */
bar.wait(bar.arrive());

生产者线程计算并填充准备好的缓冲区，然后通过到达填充屏障filled[i%2].arrive()来发出缓冲区已填充的信号。生产者线程不会在此等待，而是去等待下一次迭代的缓冲区（双缓冲）准备好可以被填充。

消费者线程首先发出信号，表示两个缓冲区都已准备就绪。消费者线程在这一点上并不等待，而是等待这个迭代的缓冲区填充好，即filled[i%2].arrive_and_wait()。消费者线程使用缓冲区后，它们会发出信号，表示缓冲区已准备好再次填充（ready[i%2].arrive()），然后等待下一个迭代的缓冲区被填充。

7.26.6. Early Exit (Dropping out of Participation)

当参与同步序列的线程必须提前退出该序列时，该线程必须在退出前显式地退出参与。其余参与的线程可以正常进行后续的cuda::barrier到达和等待操作。

#include  <cuda/barrier>
#include  <cooperative_groups.h>

__device__  bool  condition_check();

__global__  void  early_exit_kernel(int  N)  {
  using  barrier  =  cuda::barrier<cuda::thread_scope_block>;
  __shared__  barrier  bar;
  auto  block  =  cooperative_groups::this_thread_block();

  if  (block.thread_rank()  ==  0)
  init(&bar  ,  block.size());
  block.sync();

  for  (int  i  =  0;  i  <  N;  ++i)  {
  if  (condition_check())  {
  bar.arrive_and_drop();
  return;
  }
  /* other threads can proceed normally */
  barrier::arrival_token  token  =  bar.arrive();
  /* code between arrive and wait */
  bar.wait(std::move(token));  /* wait for all threads to arrive */
  /* code after wait */
  }
}

该操作使线程在cuda::barrier上到达，以履行线程在当前阶段的到达义务，然后减少下一阶段的期望到达计数，使得该线程不再预计到达cuda::barrier。

7.26.7. Completion Function

cuda::barrier<Scope, CompletionFunction>的CompletionFunction 在每个阶段执行一次，时间点为最后一个线程到达后并在任何线程从等待中解锁之前。

执行CompletionFunction的线程可以看到在该阶段到达barrier的线程所执行的内存操作，而且一旦从等待中解锁，所有在barrier处等待的线程都可以看到在CompletionFunction中执行的所有内存操作。

#include  <cuda/barrier>
#include  <cooperative_groups.h>
#include  <functional>
namespace  cg  =  cooperative_groups;

__device__  int  divergent_compute(int*,  int);
__device__  int  independent_computation(int*,  int);

__global__  void  psum(int*  data,  int  n,  int*  acc)  {
  auto  block  =  cg::this_thread_block();

  constexpr  int  BlockSize  =  128;
  __shared__  int  smem[BlockSize];
  assert(BlockSize  ==  block.size());
  assert(n  %  128  ==  0);

  auto  completion_fn  =  [&]  {
  int  sum  =  0;
  for  (int  i  =  0;  i  <  128;  ++i)  sum  +=  smem[i];
  *acc  +=  sum;
  };

  // Barrier storage
  // Note: the barrier is not default-constructible because
  //       completion_fn is not default-constructible due
  //       to the capture.
  using  completion_fn_t  =  decltype(completion_fn);
  using  barrier_t  =  cuda::barrier<cuda::thread_scope_block,
  completion_fn_t>;
  __shared__  std::aligned_storage<sizeof(barrier_t),
  alignof(barrier_t)>  bar_storage;

  // Initialize barrier:
  barrier_t*  bar  =  (barrier_t*)&bar_storage;
  if  (block.thread_rank()  ==  0)  {
  assert(*acc  ==  0);
  assert(blockDim.x  ==  blockDim.y  ==  blockDim.y  ==  1);
  new  (bar)  barrier_t{block.size(),  completion_fn};
  // equivalent to: init(bar, block.size(), completion_fn);
  }
  block.sync();

  // Main loop
  for  (int  i  =  0;  i  <  n;  i  +=  block.size())  {
  smem[block.thread_rank()]  =  data[i]  +  *acc;
  auto  t  =  bar->arrive();
  // We can do independent computation here
  bar->wait(std::move(t));
  // shared-memory is safe to re-use in the next iteration
  // since all threads are done with it, including the one
  // that did the reduction
  }
}

7.26.8. Memory Barrier Primitives Interface

内存屏障原语是cuda::barrier功能的类 C 接口。这些原语可通过包含<cuda_awbarrier_primitives.h>头文件获得。

7.26.8.1. Data Types

typedef  /* implementation defined */  __mbarrier_t;
typedef  /* implementation defined */  __mbarrier_token_t;

7.26.8.2. Memory Barrier Primitives API

uint32_t  __mbarrier_maximum_count();
void  __mbarrier_init(__mbarrier_t*  bar,  uint32_t  expected_count);

bar必须是指向__shared__内存的指针。
expected_count <= __mbarrier_maximum_count()。
将*bar的当前阶段和下一阶段的预期到达计数初始化为expected_count。

void  __mbarrier_inval(__mbarrier_t*  bar);

bar必须是指向驻留在共享内存中的 mbarrier 对象的指针。
在重新使用相应的共享内存之前，必须先使*bar无效。

__mbarrier_token_t  __mbarrier_arrive(__mbarrier_t*  bar);

*bar的初始化必须在此调用之前完成。
挂起计数不能为零。
原子地减少屏障当前阶段的挂起计数。
返回一个与递减之前的屏障状态相关联的到达令牌。

__mbarrier_token_t  __mbarrier_arrive_and_drop(__mbarrier_t*  bar);

*bar的初始化必须在此调用之前完成。
挂起计数不能为零。
原子地减少屏障当前阶段的挂起计数和下一阶段的预期计数。
返回一个与递减之前的屏障状态相关联的到达令牌。

bool  __mbarrier_test_wait(__mbarrier_t*  bar,  __mbarrier_token_t  token);

token必须与*this的前一阶段或当前阶段相关联。
如果token与*bar的前一个阶段相关联，则返回true，否则返回false。

//Note: This API has been deprecated in CUDA 11.1
uint32_t  __mbarrier_pending_count(__mbarrier_token_t  token);