基本思想:【摘自算法导论第三版P108】
对每一个输入元素x,确定小于x的元素的个数。利用这一信息,就可以直接把x放到它在输出数组中的位置上了。
serial code for sorting an array of 8-bit unsigned numbers:
void CountingSort( unsigned char* a, unsigned long a_size )
{
const unsigned long numberOfCounts = 256;
// one count for each possible value of an 8-bit element (0-255)
unsigned long count[ numberOfCounts ] = { 0 }; // count array is initialized to zero by the compiler
// Scan the array and count the number of times each value appears
for( unsigned long i = 0; i < a_size; i++ )
count[ a[ i ] ]++;
// Fill the array with the number of 0's that were counted, followed by the number of 1's, and then 2's and so on
unsigned long n = 0;
for( unsigned long i = 0; i < numberOfCounts; i++ )
for( unsigned long j = 0; j < count[ i ]; j++ )
a[ n++ ] = (unsigned char)i;
}
test code:
#include <vector>
#include <iostream>
using namespace std;
typedef unsigned int uint;
const int numData = 100;
const uint k = 256;
void countSort(vector<int>& inputVec)
{
const uint n = inputVec.size();
// initialize count array
vector<uint> countVec(k, 0);
// count each rank value occurrences
for (uint i = 0; i < n; ++i)
++countVec[inputVec[i]];
//
uint z = 0;
for (uint i = 0; i < k; ++i)
for (uint j = 0; j < countVec[i]; ++j)
inputVec[z++] = i;
}
void printData(vector<int> &testVec)
{
const uint numData = testVec.size();
for (int i = 0; i < numData; ++i)
cout << testVec[i] << " ";
cout << endl;
}
int main()
{
// test data
vector<int> testVec;
testVec.reserve(numData);
for (int i = 0; i < numData; ++i)
{
uint tempVal = (i * i) % k;
testVec.push_back(tempVal);
}
cout << "test data before sorting : " << endl;
printData(testVec);
// counting sort
countSort(testVec);
cout << "test data after sorting : " << endl;
printData(testVec);
return 0;
}
小结:
计数排序比较适合取值范围k比较小而数据量很大的情形( k << n)
Parallel Counting Sort 参考:
http://www.drdobbs.com/architecture-and-design/parallel-counting-sort/224700144
CUDA Counting Sort:
1. 待排序数据从CPU拷贝到GPU (GPU only 操作可忽略一次数据传输开销, 建议:use host pinned memory)
2. 计数数组的初始化 (for GPU only scenario, 分配一次,然后每一次排序前use cudaMemset 恢复初值)
3. 计数统计
4. 填充输出数组
5. 已排序数据从GPU拷贝回CPU (GPU only 操作可忽略一次数据传输开销, 建议:use host pinned memory)
to be continued...