CUDA by Example里附录A介绍了用cuda提供的API做一个mutex的方法。C++代码:
struct Lock {
int *mutex;
Lock( void ) {
int state = 0;
HANDLE_ERROR( cudaMalloc( (void**)& mutex,
sizeof(int) ) );
HANDLE_ERROR( cudaMemcpy( mutex, &state, sizeof(int),
cudaMemcpyHostToDevice ) );
}
~Lock( void ) {
cudaFree( mutex );
}
__device__ void lock( void ) {
while( atomicCAS( mutex, 0, 1 ) != 0 );
}
__device__ void unlock( void ) {
atomicExch( mutex, 1 );
}
};
很奇怪,为什么lock时是把mutex置为1,而解锁时好像还是置为1?
书里对于atomicCAS和atomicExch的解释如下:
The call to atomicCAS() returns the value that it found at the address mutex. As a result, the while() loop will continue to run until atomicCAS() sees a 0 at mutex. When it sees a 0, the comparison is successful, and the thread writes a 1 to mutex. Essentially, the thread will spin in the while() loop until it has successfully locked the data structure.
The function atomicExch() reads the value that is located at mutex, exchanges it with the second argument (a 1 in this case), and returns the original value it read. Why would we use an atomic function for this rather than the more obvious method to reset the value at mutex?
*mutex = 1;
If you’re expecting some subtle, hidden reason why this method fails, we hate to disappoint you, but this would work as well. So, why not use this more obvious
大体是说,atomicCAS()将第一个参数指向的int与第二个参数比较,如果相等赋值为第三个参数。返回值为第一个参数指向的int变量里原来的值。
atomicExch()将第一个参数的指向的int赋值为第二个参数。
谷歌后,这个问题没解决,却发现了别人的另一个奇怪的问题。
http://stackoverflow.com/questions/2021019/how-to-implement-a-critical-section-in-cuda
大意是,某人写了一段mutex的代码,但是好像进入了死锁:
#include <cuda_runtime.h>
#include <cutil_inline.h>
#include <stdio.h>
__global__ void k_testLocking(unsigned int* locks, int n) {
int id = threadIdx.x % n;
while (atomicExch(&(locks[id]), 1u) != 0u) {} //lock
//critical section would go here
atomicExch(&(locks[id]),0u); //unlock
}
int main(int argc, char** argv) {
//initialize the locks array on the GPU to (0...0)
unsigned int* locks;
unsigned int zeros[10]; for (int i = 0; i < 10; i++) {zeros[i] = 0u;}
cutilSafeCall(cudaMalloc((void**)&locks, sizeof(unsigned int)*10));
cutilSafeCall(cudaMemcpy(locks, zeros, sizeof(unsigned int)*10, cudaMemcpyHostToDevice));
//Run the kernel:
k_testLocking<<<dim3(1), dim3(256)>>>(locks, 10);
//Check the error messages:
cudaError_t error = cudaGetLastError();
cutilSafeCall(cudaFree(locks));
if (cudaSuccess != error) {
printf("error 1: CUDA ERROR (%d) {%s}\n", error, cudaGetErrorString(error));
exit(-1);
}
return 0;
}
我看了半天感觉应该没问题啊,一般教科书上好像就这么写啊。
回答是将kernel函数改为这种形式:
__global__ void k_testLocking(unsigned int* locks, int n) {
int id = threadIdx.x % n;
bool leaveLoop = false;
while (!leaveLoop) {
if (atomicExch(&(locks[id]), 1u) == 0u) {
//critical section
leaveLoop = true;
atomicExch(&(locks[id]),0u);
}
}
}
回答者的解释是:
Okay, I figured it out, and this is yet-another-one-of-the-cuda-paradigm-pains.
As any good cuda programmer knows (notice that I did not remember this which makes me a bad cuda programmer, I think) all threads in a warp must execute the same code. The code I wrote would work perfectly if not for this fact. As it is, however, there are likely to be two threads in the same warp accessing the same lock. If one of them acquires the lock, it just forgets about executing the loop, but it cannot continue past the loop until all other threads in its warp have completed the loop. Unfortunately the other thread will never complete because it is waiting for the first one to unlock.
之前真觉着这句话没啥用,没想到这里用到了。提问者的代码里,用同一个锁的线程也在同一个warp里,一个线程拿到了锁,其他线程没有拿到,所以他们都执行不下去了。而且会一直死锁下去。
改进的代码里,不论是否得到锁,都会执行下一条代码,得到了锁的就会执行完critical section的代码后跳出循环。
warp具体的工作机制、工作流程还不太懂,不过至少明白了cuda里不要像前者那样实现mutex。经过测试,第一个确实会死掉,第二个确实可行。
可见,虽然是经典的上锁、解锁操作,但是在cpu上能行的实现在gpu上不一定能行。具体问题具体分析。程序错了一定有错的原因,一般来说是编程者的理解不到位。
好,回到开始时的那个struct lock。
有人试过cuda by Example附录A里的那个代码吗?我是没试过整个代码。
但是,将作者定义的struct Lock放到一个简单的例子里测试后,发现会死锁,不论解锁时,将mutex置为1还是0。我想应该是warp的原因。
使用类似于回答者给出的实现方案,可行。
如下:
#include <cuda_runtime.h>
#include <cutil_inline.h>
#include <stdio.h>
const int N = 10;
__global__ void testLock(unsigned int* mutex, unsigned int* dev_sum)
{
int tid = threadIdx.x;
bool finished = false;
// ++(*dev_sum); 如果我将下边的代码注释掉,只使用++(*dev_sum),我得到的结果始终是1。猜测是因为warp里的thread同时取出了dev_sum(0),加1后,分别 //存回内存。
if(tid < N)
{
while(!finished)
{
if( atomicExch(mutex, 1) == 0)
{
++(*dev_sum);
finished = true;
atomicExch(mutex, 0);
}
}
}
}
int main()
{
unsigned int *dev_sum,sum;
unsigned int *mutex;
unsigned int zero = 0;
sum = 10;
cudaMalloc( (void**)&dev_sum, sizeof(unsigned int) );
cudaMalloc( (void**)&mutex, sizeof(unsigned int) );
cudaMemcpy(dev_sum, &zero, sizeof(unsigned int), cudaMemcpyHostToDevice );
cudaMemcpy(mutex, &zero, sizeof(unsigned int), cudaMemcpyHostToDevice );
testLock<<<1, N>>>(mutex, dev_sum);
cudaMemcpy(&sum, dev_sum, sizeof(unsigned int), cudaMemcpyDeviceToHost );
printf("%d\n",sum);
cudaFree(dev_sum);
cudaFree(mutex);
}
而这个例子我用struct Lock实现时,就会卡呀卡,然后因为死锁屏幕闪了一下(如果我在gpu代码里放个死循环怎么样?一会儿去试试)。
看来gpu的warp是一个平时会帮你提高效率,偶尔会找点儿事儿的东西。