float double在内核转换为int

探讨了在没有浮点协处理器的低端嵌入式处理器上,如何通过软件模拟实现float和double类型到int类型的转换。文章深入解析了IEEE754标准下的浮点数存储结构,以及如何在精度损失的情况下进行比较和转换。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

float double在kernel转换为int

目前大多数CPU都支持浮点运算单元,但是对于低端的嵌入式处理器可能就会去掉浮点协处理器。
在调试一个sensor就遇到这个问题,由于这个处理器是32位arm,比较低端,没有浮点运算,而在开发过程中又需要将float转为int,因此就需要软件模拟浮点运算来实现。

存储结构

存储方式是用科学计数法来存储数据的。

Tip

科学记数法 是一种以记下极大或极小数字的方法。 在科学记数法中,一个数被写成一个1与10之间的实数(尾数)与一个10的幂的积, 为了得到统一的表达方式,该尾数并不包括10:

782300=7.823×10^5

0.00012=1.2×10^-4

10000=1×10^4

在电脑或计算器中一般用E或e(英语Exponential)来表示10的幂:

7.823E5=782300

1.2e-4=0.00012

采用二进制浮点算法的 IEC 60559:1989 (IEEE 754) 标准。

TypeSignExponentMantissa
float1 bit8 bit23 bit
double1 bit11 bit52 bit
  • Sign(符号位) :

    0代表正,1代表负。

  • Exponent(指数):

    指数偏移值(exponent bias),是指浮点数表示法中的指数域的编码值为指数的实际值加上某个固定的值, IEEE 754标准规定该固定值为 2^e-1 - 1,其中的e为存储指数的位元的长度。 如:float指数是8 bit,固定偏移值是2^8-1 - 1 = 128−1 = 127。 单精度浮点数的指数部分实际取值是从128到-127。 例如指数实际值为17,在单精度浮点数中的指数域编码值为144, 即144 = 17 + 127。

  • Mantissa(尾数):

    被存储为 1.XXX… 形式的二进制分数。此分数有一个大于或等于 1 且小于 2 的值。 注意实数总是以规范化形式存储;即尾数左移以使尾数的高序位总是 1。 因为该位总是 1,所以存储尾数时,只存储 XXX… 。

如,十进制浮点数120.5的二进制形式为1111000.1,转换为科学计数法形式为(1.1110001)*(2^6), 指数偏移值为6+127,尾数为1110001。尾数则直接填入,如果空间多余则以0补齐,如果空间不够则0舍1入。 所以十进制浮点数120.5的float类型存储如下(二进制):

SignExponentMantissa
01000 0101111 0001 0000 0000 0000 0000
06+127=133110 1101

取值范围

在Microsoft Visual C++ 中取值范围如下:

Type NameBytesRange of Values
__int324–2,147,483,648 to 2,147,483,647
unsigned __int3240 to 4,294,967,295
__int648–9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
unsigned __int6480 to 18,446,744,073,709,551,615
float43.4E +/- 38 (7 digits)
double81.7E +/- 308 (15 digits) <—这个指15 个十进制位

比较浮点数相等

浮点十进制值通常没有完全相同的二进制表示形式。 这是 CPU 所采用的浮点数据表示形式的副作用。为此,可能会经历一些精度丢失,并且一些浮点运算可能会产生意外的结果。 导致此行为的原因是下面之一:

十进制数的二进制表示形式可能不精确。

使用的数字之间类型不匹配(例如,混合使用浮点型和双精度型)。

为解决此行为,大多数程序员或是确保值比需要的大或者小,或是获取并使用可以维护精度的二进制编码的十进制 (BCD) 库。 浮点值的二进制表示形式影响浮点计算的精度和准确性。

忽略精度

建议采用的一种方法是定义两个值之间可接受的差值幅度(例如.000001),而不是比较其是否相等。 如果两个值之间的绝对差值小于或等于该幅度,则差值可能是因精度差异而产生的,因此这两个值可能相等。 下面的示例使用此方法比较 .33333 和 1/3。

#define EPSILON         .000001   // Define your own tolerance
#define FLOAT_EQ(x,v)   (((v - EPSILON) < x) && (x <( v + EPSILON)))

double double1 = .3333333;
double double2 = (double) 1/3;

if (FLOAT_EQ(double1, double2))
     printf("double1 and double2 are equal.");
else
     printf("double1 and double2 are unequal.");

if (fabs(double1 - double2) < .000001)
     printf("double1 and double2 are equal.");
else
     printf("double1 and double2 are unequal.");

对于 EPSILON,可以使用常数 FLT_EPSILON(为浮点型定义为 1.192092896e-07F)或者 DBL_EPSILON(为双精度型定义为 2.2204460492503131e-016)。

直接降低精度

  • 比如想对 double d 降低精度,只需要做

*((__int64*)&d)&=~0xf;0xf 是可调整的,这一句就是把尾数最后几位清为 0 而已。PS: 感觉这种做法的精度有问题.对于指数大于53的应当存在很大的误差。

浮点型转整型

Tip

当float转为整型值时,小数部分被舍弃。如果浮点数的值过于庞大,无法容纳于整型值中,那么其结果将是未定义的。《C与指针》

从尾数位数来看,double可是表示54位二进制,所以__int32–>double–>__int32,是没有误差的; 但__int64超出这个范围,其互转就不精确了。

Lua中number是用double存储的,在脚本和C API的交互时,经常出现int、double互转。 PS:在lua使用lua_pushnumber(DWORD)写入,则对应的要使用(DWORD)lua_tonumber()读取。 当lua_pushnumber(DWORD)写入但采用(DWORD)lua_tointeger读取的时候,当数值大于INT_MAX时,会出现数据错误。

ftol的一种实现方式

_ftol 的整数指令版本

int ftol(float f)
{
     int a         = *(int*)(&f);
     int sign      = (a>>31);
     int mantissa  = (a&((1<<23)-1))|(1<<23);
     int exponent  = ((a&0x7fffffff)>>23)-127;
     int r         = ((unsigned int)(mantissa)<<8)>>(31-exponent);
     return ((r ^ (sign)) - sign ) &~ (exponent>>31);
}
  • 更多ftol的实现和比较可参考

代码优化-之-优化浮点数取整

lua中double to int 的实现

union luai_Cast { double l_d; long l_l; };
#define lua_number2int(i,d) \
{ volatile union luai_Cast u; u.l_d = (d) + 6755399441055744.0; (i) = u.l_l; }
  • 分析

这个宏神奇的在正数和负数的双精度浮点数时都可以正确工作,以四舍五入方式转换为 32 位整数。 这个数字是 1.5*2^52, 小于 2^31 的数字在和这个magic number 相加的时候, 按浮点加法的规则(以科学计数法记数),和一定按幂大的一个对齐。 而 1.5 是2进制的 1.1 在浮点标准中,小数点前的 1 是不需要记录的。 这样,double 的前四字节就被空出来。而需要转换的整数将因为加法恰当的被置入对应的位置。

#include <stdio.h> #include <stdlib.h> #include "dpc_t.cuh" // Number of blocks #define NUM_BLOCKS 1024 // Number of threads per block #define NUM_COUNT 512 // Dimension of the dataset #define DIMENSION 4 using namespace std; /*---------------------- 核函数 ----------------------*/ static __global__ void find_cell_contents(const int N, int* cell_count, const int* cell_count_sum, int* cell_contents, float* d_data, const int nx, const int ny, const int nz, const int nw, const double rc_inv, int M) { const int n1 = blockIdx.x * blockDim.x + threadIdx.x; if (n1 < N) { float4 temp_n1; temp_n1.x = d_data[n1 * M + 0]; temp_n1.y = d_data[n1 * M + 1]; temp_n1.z = d_data[n1 * M + 2]; temp_n1.w = d_data[n1 * M + 3]; int cell_id; find_cell_id(temp_n1.x, temp_n1.y, temp_n1.z, temp_n1.w, rc_inv, nx, ny, nz, nw, cell_id); const int ind = atomicAdd(&cell_count[cell_id], 1); cell_contents[cell_count_sum[cell_id] + ind] = n1; } } void __global__ DPC_computeRho_kernel(float* d_data, int* d_NL, int N, int M, float dc) { int n1 = blockIdx.x * blockDim.x + threadIdx.x; if (n1 < N) { float4 temp_n1; temp_n1.x = d_data[n1 * M + 0]; temp_n1.y = d_data[n1 * M + 1]; temp_n1.z = d_data[n1 * M + 2]; temp_n1.w = d_data[n1 * M + 3]; for (int n2 = n1 + 1; n2 < N; n2++) { float distance_square = 0; distance_square = distance_square + (temp_n1.x - d_data[n2 * M + 0]) * (temp_n1.x - d_data[n2 * M + 0]); distance_square = distance_square + (temp_n1.y - d_data[n2 * M + 1]) * (temp_n1.y - d_data[n2 * M + 1]); distance_square = distance_square + (temp_n1.z - d_data[n2 * M + 2]) * (temp_n1.z - d_data[n2 * M + 2]); distance_square = distance_square + (temp_n1.w - d_data[n2 * M + 3]) * (temp_n1.w - d_data[n2 * M + 3]); if (distance_square < dc) { //d_NL[n1]++; //d_NL[n2]++; atomicAdd(&d_NL[n1], 1); atomicAdd(&d_NL[n2], 1); } } } } void __global__ DPC_computeRho_kernel_s(float* d_data, int* d_NL, int N, int M, float dc) { // 当前块的起始索引 int block_start = blockIdx.x * blockDim.x; int thread_id = threadIdx.x; // 共享内存声明 __shared__ float4 shared_data[NUM_COUNT]; // 将当前块的数据加载到共享内存 if (block_start + thread_id < N) { shared_data[thread_id] = *((float4*)(d_data + (block_start + thread_id) * M)); } __syncthreads(); // 每个线程处理一个n1 int n1 = block_start + thread_id; if (n1 >= N) return; // 局部计数器减少全局原子操作 int local_count = 0; float4 temp_n1 = *((float4*)(d_data + n1 * M)); // 处理块内数据点 for (int n2_in_block = 0; n2_in_block < blockDim.x; ++n2_in_block) { int n2_global = block_start + n2_in_block; if (n2_global <= n1 || n2_global >= N) continue; float4 temp_n2 = shared_data[n2_in_block]; float dx = temp_n1.x - temp_n2.x; float dy = temp_n1.y - temp_n2.y; float dz = temp_n1.z - temp_n2.z; float dw = temp_n1.w - temp_n2.w; float dist_sq = dx * dx + dy * dy + dz * dz + dw * dw; if (dist_sq < dc) { local_count++; } } // 处理后续数据块 for (int other_block = blockIdx.x + 1; other_block < gridDim.x; ++other_block) { int other_start = other_block * blockDim.x; // 加载其他块到共享内存 __shared__ float4 other_shared[NUM_COUNT]; if (other_start + thread_id < N) { other_shared[thread_id] = *((float4*)(d_data + (other_start + thread_id) * M)); } __syncthreads(); // 与当前块数据比较 for (int n2_in_other = 0; n2_in_other < blockDim.x; ++n2_in_other) { int n2_global = other_start + n2_in_other; if (n2_global >= N) continue; float4 temp_n2 = other_shared[n2_in_other]; float dx = temp_n1.x - temp_n2.x; float dy = temp_n1.y - temp_n2.y; float dz = temp_n1.z - temp_n2.z; float dw = temp_n1.w - temp_n2.w; float dist_sq = dx * dx + dy * dy + dz * dz + dw * dw; if (dist_sq < dc) { local_count++; } } __syncthreads(); } // 最终原子操作 atomicAdd(&d_NL[n1], local_count); } void __global__ DPC_computeRho_kernel_s_2(float* d_data, int* d_NL, int N, int M, float dc) { // 修改点1:共享内存大小扩大一倍以存储两个点的数据 __shared__ float4 shared_data[NUM_COUNT * 2]; // 使用动态共享内存 // 修改点2:计算块起始索引时考虑双倍处理量 int block_start = blockIdx.x * (blockDim.x * 2); int thread_id = threadIdx.x; // 修改点3:每个线程处理两个相邻的n1点 int n1_1 = block_start + 2 * thread_id; int n1_2 = n1_1 + 1; // 修改点4:双点数据加载到共享内存 if (n1_1 < N) { shared_data[2 * thread_id] = *((float4*)(d_data + n1_1 * M)); } if (n1_2 < N) { shared_data[2 * thread_id + 1] = *((float4*)(d_data + n1_2 * M)); } __syncthreads(); // 修改点5:双点局部计数器 int local_count_1 = 0, local_count_2 = 0; float4 temp_n1_1, temp_n1_2; // 加载当前线程处理的两个点数据 if (n1_1 < N) temp_n1_1 = shared_data[2 * thread_id]; if (n1_2 < N) temp_n1_2 = shared_data[2 * thread_id + 1]; // 处理当前块内数据点(范围扩大为双倍) for (int n2_in_block = 0; n2_in_block < 2 * blockDim.x; ++n2_in_block) { int n2_global = block_start + n2_in_block; if (n2_global >= N) continue; float4 temp_n2 = shared_data[n2_in_block]; // 处理第一个点n1_1 if (n1_1 < N && n2_global > n1_1) { float dx = temp_n1_1.x - temp_n2.x; float dy = temp_n1_1.y - temp_n2.y; float dz = temp_n1_1.z - temp_n2.z; float dw = temp_n1_1.w - temp_n2.w; float dist_sq = dx * dx + dy * dy + dz * dz + dw * dw; if (dist_sq < dc) local_count_1++; } // 处理第二个点n1_2 if (n1_2 < N && n2_global > n1_2) { float dx = temp_n1_2.x - temp_n2.x; float dy = temp_n1_2.y - temp_n2.y; float dz = temp_n1_2.z - temp_n2.z; float dw = temp_n1_2.w - temp_n2.w; float dist_sq = dx * dx + dy * dy + dz * dz + dw * dw; if (dist_sq < dc) local_count_2++; } } // 处理后续数据块(需要调整other_block计算) for (int other_block = blockIdx.x + 1; other_block < gridDim.x; ++other_block) { int other_start = other_block * (blockDim.x * 2); // 修改点6:加载其他块的双点数据 //__shared__ float4 other_shared[NUM_COUNT * 2]; int load_pos = 2 * thread_id; if (other_start + load_pos < N) { shared_data[load_pos] = *((float4*)(d_data + (other_start + load_pos) * M)); } if (other_start + load_pos + 1 < N) { shared_data[load_pos + 1] = *((float4*)(d_data + (other_start + load_pos + 1) * M)); } __syncthreads(); // 比较其他块中的点 for (int n2_in_other = 0; n2_in_other < 2 * blockDim.x; ++n2_in_other) { int n2_global = other_start + n2_in_other; if (n2_global >= N) continue; float4 temp_n2 = shared_data[n2_in_other]; // 处理第一个点n1_1 if (n1_1 < N && n2_global > n1_1) { float dx = temp_n1_1.x - temp_n2.x; float dy = temp_n1_1.y - temp_n2.y; float dz = temp_n1_1.z - temp_n2.z; float dw = temp_n1_1.w - temp_n2.w; float dist_sq = dx * dx + dy * dy + dz * dz + dw * dw; if (dist_sq < dc) local_count_1++; } // 处理第二个点n1_2 if (n1_2 < N && n2_global > n1_2) { float dx = temp_n1_2.x - temp_n2.x; float dy = temp_n1_2.y - temp_n2.y; float dz = temp_n1_2.z - temp_n2.z; float dw = temp_n1_2.w - temp_n2.w; float dist_sq = dx * dx + dy * dy + dz * dz + dw * dw; if (dist_sq < dc) local_count_2++; } } __syncthreads(); } // 修改点7:双点原子操作 if (n1_1 < N) atomicAdd(&d_NL[n1_1], local_count_1); if (n1_2 < N) atomicAdd(&d_NL[n1_2], local_count_2); } void __global__ DPC_computeRhoM_kernel(float* d_data, int* d_NL, int N, int M, float dc, int nx, int ny, int nz, int nw, float rc_inv, const int* cell_counts, const int* cell_count_sum, const int* cell_contents) { const int n1 = blockIdx.x * blockDim.x + threadIdx.x; int count = 0; if (n1 < N) { float4 temp_n1; temp_n1.x = d_data[n1 * M + 0]; temp_n1.y = d_data[n1 * M + 1]; temp_n1.z = d_data[n1 * M + 2]; temp_n1.w = d_data[n1 * M + 3]; int cell_id; int cell_id_x; int cell_id_y; int cell_id_z; int cell_id_w; find_cell_id(temp_n1.x, temp_n1.y, temp_n1.z, temp_n1.w, rc_inv, nx, ny, nz, nw, cell_id_x, cell_id_y, cell_id_z, cell_id_w, cell_id); for (int dw = -1; dw <= 1; ++dw) { for (int dz = -1; dz <= 1; ++dz) { for (int dy = -1; dy <= 1; ++dy) { for (int dx = -1; dx <= 1; ++dx) { int neighbor_cell = cell_id + dw * nx * ny * nz + dz * nx * ny + dy * nx + dx; if (cell_id_x + dx < 0) neighbor_cell += nx; if (cell_id_x + dx >= nx) neighbor_cell -= nx; if (cell_id_y + dy < 0) neighbor_cell += ny * nx; if (cell_id_y + dy >= ny) neighbor_cell -= ny * nx; if (cell_id_z + dz < 0) neighbor_cell += nz * ny * nx; if (cell_id_z + dz >= nz) neighbor_cell -= nz * ny * nx; if (cell_id_z + dw < 0) neighbor_cell += nz * ny * nx * nw; if (cell_id_z + dw >= nz) neighbor_cell -= nz * ny * nx * nw; const int num_atoms_neighbor_cell = cell_counts[neighbor_cell]; const int num_atoms_previous_cells = cell_count_sum[neighbor_cell]; for (int m = 0; m < num_atoms_neighbor_cell; ++m) { const int n2 = cell_contents[num_atoms_previous_cells + m]; if (n1 != n2) { float x12 = temp_n1.x - d_data[n2 * M + 0]; float y12 = temp_n1.y - d_data[n2 * M + 1]; float z12 = temp_n1.z - d_data[n2 * M + 2]; float w12 = temp_n1.w - d_data[n2 * M + 3]; const float d2 = x12 * x12 + y12 * y12 + z12 * z12 + w12 * w12; if (d2 < dc) { atomicAdd(&d_NL[n1], 1); atomicAdd(&d_NL[n2], 1); } } } } } } } } } void __global__ DPC_computeRhoD_kernel(float* d_data, int* d_NL, int N, int M, float dc, float* distance, float* min_dist, float* max_dist,int* min_idx) { int n1 = blockIdx.x * blockDim.x + threadIdx.x; if (n1 < N) { float4 temp_n1; temp_n1.x = d_data[n1 * M + 0]; temp_n1.y = d_data[n1 * M + 1]; temp_n1.z = d_data[n1 * M + 2]; temp_n1.w = d_data[n1 * M + 3]; float min = 99999999.9, max = 0; int ti = 0, tj = 0; for (int n2 = 0; n2 < N; n2++) { float distance_square = 0; distance_square = distance_square + (temp_n1.x - d_data[n2 * M + 0]) * (temp_n1.x - d_data[n2 * M + 0]); distance_square = distance_square + (temp_n1.y - d_data[n2 * M + 1]) * (temp_n1.y - d_data[n2 * M + 1]); distance_square = distance_square + (temp_n1.z - d_data[n2 * M + 2]) * (temp_n1.z - d_data[n2 * M + 2]); distance_square = distance_square + (temp_n1.w - d_data[n2 * M + 3]) * (temp_n1.w - d_data[n2 * M + 3]); distance[n2] = distance_square; if (distance_square < dc && n2>n1) { //d_NL[n1]++; //d_NL[n2]++; atomicAdd(&d_NL[n1], 1); atomicAdd(&d_NL[n2], 1); } if (n2 == n1)continue; if (distance[n2] > max) { max = distance[n2]; ti = n2; } if (distance[n2] < min) { min = distance[n2]; tj = n2; } } min_dist[n1] = min; min_idx[n1] = tj; max_dist[n1] = max; } } void __global__ DPC_computeDeltaAndNearestHigherD_kernel(float* d_delta, float* d_nearest_higher, int N, int M, int* d_indices, float* d_data) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) { int current_idx = d_indices[i]; float4 temp_n1; temp_n1.x = d_data[current_idx * M + 0]; temp_n1.y = d_data[current_idx * M + 1]; temp_n1.z = d_data[current_idx * M + 2]; temp_n1.w = d_data[current_idx * M + 3]; float min_dist = FLT_MAX; float max_dist = 0.0; int nearest = -1; for (int j = 0; j < i; ++j) { int higher_idx = d_indices[j]; float distance_square = 0; distance_square = distance_square + (temp_n1.x - d_data[higher_idx * M + 0]) * (temp_n1.x - d_data[higher_idx * M + 0]); distance_square = distance_square + (temp_n1.y - d_data[higher_idx * M + 1]) * (temp_n1.y - d_data[higher_idx * M + 1]); distance_square = distance_square + (temp_n1.z - d_data[higher_idx * M + 2]) * (temp_n1.z - d_data[higher_idx * M + 2]); distance_square = distance_square + (temp_n1.w - d_data[higher_idx * M + 3]) * (temp_n1.w - d_data[higher_idx * M + 3]); double dist = distance_square; if (dist < min_dist) { min_dist = dist; nearest = higher_idx; } } if (nearest == -1) { // 处理密度最高的点 float max_dist = 0.0; for (int k = 0; k < N; ++k) { float distance_square = 0; distance_square = distance_square + (temp_n1.x - d_data[k * M + 0]) * (temp_n1.x - d_data[k * M + 0]); distance_square = distance_square + (temp_n1.y - d_data[k * M + 1]) * (temp_n1.y - d_data[k * M + 1]); distance_square = distance_square + (temp_n1.z - d_data[k * M + 2]) * (temp_n1.z - d_data[k * M + 2]); distance_square = distance_square + (temp_n1.w - d_data[k * M + 3]) * (temp_n1.w - d_data[k * M + 3]); if (k != current_idx && distance_square > max_dist) { max_dist = distance_square; } } d_delta[current_idx] = max_dist; d_nearest_higher[current_idx] = -1; } else { d_delta[current_idx] = min_dist; d_nearest_higher[current_idx] = nearest; } } } void __global__ DPC_computeDeltaAndNearestHigherR_kernel(float* d_delta, float* d_nearest_higher, int N, int M, int* d_indices, float* d_data, float* min_dist, float* max_dist, int* min_idx) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) { int current_idx = d_indices[i]; float min_dist_r = FLT_MAX; int nearest = min_idx[i]; min_dist_r = min_dist[i]; if (nearest == -1) { // 处理密度最高的点 float max_dist_r = max_dist[i]; d_delta[current_idx] = max_dist_r; d_nearest_higher[current_idx] = -1; } else { d_delta[current_idx] = min_dist_r; d_nearest_higher[current_idx] = nearest; } } } /*---------------------- 封装函数 ----------------------*/ extern "C" void launch_kernel_DPC_computeRho(float* h_data, int* h_NL, int N, int M, float dc) { float dc2 = 1.0 * 1.0; float* d_data; int* d_NL; cudaMalloc(&d_data, M * N * sizeof(float)); cudaMalloc(&d_NL, N * sizeof(int)); cudaMemcpy(d_data, h_data, M * N * sizeof(float), cudaMemcpyHostToDevice); cudaDeviceSynchronize(); //DPC_computeRho_kernel << <dim3(NUM_BLOCKS, 1), dim3(NUM_COUNT, 1) >> > (d_data, d_NL, N, M, dc2); DPC_computeRho_kernel_s << <dim3(NUM_BLOCKS, 1), dim3(NUM_COUNT, 1) >> > (d_data, d_NL, N, M, dc2); //DPC_computeRho_kernel_s_2 << <dim3(NUM_BLOCKS / 2, 1), dim3(NUM_COUNT, 1) >> > (d_data, d_NL, N, M, dc2); cudaMemcpy(h_NL, d_NL, N * sizeof(int), cudaMemcpyDeviceToHost); cudaFree(d_NL); cudaFree(d_data); } extern "C" void launch_kernel_DPC_computeDeltaAndNearestHigherD(float* h_delta, float* h_nearest_higher, int N, int M, int* h_indices, float* h_data) { float* d_delta; float* d_nearest_higher; float* d_data; int* d_indices; cudaMalloc(&d_delta, N * sizeof(float)); cudaMalloc(&d_nearest_higher, N * sizeof(float)); cudaMalloc(&d_data, M * N * sizeof(float)); cudaMalloc(&d_indices, N * sizeof(int)); cudaMemcpy(d_data, h_data, M * N * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(d_indices, h_indices, N * sizeof(int), cudaMemcpyHostToDevice); cudaDeviceSynchronize(); //DPC_computeDeltaAndNearestHigherR_kernel << <dim3(NUM_BLOCKS, 1), dim3(NUM_COUNT, 1) >> > (d_delta, d_nearest_higher, N, M, d_indices, d_data); DPC_computeDeltaAndNearestHigherD_kernel << <dim3(NUM_BLOCKS, 1), dim3(NUM_COUNT, 1) >> > (d_delta, d_nearest_higher, N, M, d_indices, d_data); cudaMemcpy(h_delta, d_delta, N * sizeof(float), cudaMemcpyDeviceToHost); cudaMemcpy(h_nearest_higher, d_nearest_higher, N * sizeof(int), cudaMemcpyDeviceToHost); cudaFree(d_data); cudaFree(d_delta); cudaFree(d_nearest_higher); cudaFree(d_indices); } extern "C" void launch_thrust_findClusterCenters(float* h_rho_delta, int* ord_clusters, int N) { thrust::host_vector<MyStruct> data_host(N); thrust::device_vector<MyStruct> data_device(N); for (int i = 0; i < N; ++i) { data_host[i] = { h_rho_delta[i], i}; } data_device = data_host; // 将host端的数据拷贝到device端 thrust::sort(data_device.rbegin(), data_device.rend()); // 进行排序 data_host = data_device; for (int i = 0; i < N; ++i) { ord_clusters[i] = data_host[i].id; } } extern "C" void launch_thrust_assignClusters(float* h_ord_rho, int* h_ord_indices, int N) { thrust::host_vector<MyStruct> data_host(N); thrust::device_vector<MyStruct> data_device(N); for (int i = 0; i < N; ++i) { data_host[i] = { h_ord_rho[i], i }; } data_device = data_host; // 将host端的数据拷贝到device端 thrust::sort(data_device.rbegin(), data_device.rend()); // 进行排序 data_host = data_device; for (int i = 0; i < N; ++i) { h_ord_indices[i] = data_host[i].id; } } extern "C" void launch_kernel_DPC_computeRhoM(float* h_data, int* h_NL, int N, int M, int dc, int* h_max_min) { float dc2 = 1.0 * 1.0; float rc_inv = 1.0 / 1.0; int N_cells = h_max_min[0] * h_max_min[1] * h_max_min[2] * h_max_min[3]; float* d_data; int* d_NL; cudaMalloc(&d_data, M * N * sizeof(float)); cudaMalloc(&d_NL, N * sizeof(int)); cudaMemcpy(d_data, h_data, M * N * sizeof(float), cudaMemcpyHostToDevice); int* d_cell_counts; int* d_cell_count_sum; int* d_cell_contents; cudaMalloc(&d_cell_counts, N_cells * sizeof(int)); cudaMalloc(&d_cell_count_sum, N_cells * sizeof(int)); cudaMalloc(&d_cell_contents, N * sizeof(int)); cudaDeviceSynchronize(); find_cell_contents << <dim3(NUM_BLOCKS, 1), dim3(NUM_COUNT, 1) >> > (N, d_cell_counts, d_cell_count_sum, d_cell_contents, d_data, h_max_min[0], h_max_min[1], h_max_min[2], h_max_min[3], rc_inv, M); cudaDeviceSynchronize(); DPC_computeRhoM_kernel << <dim3(NUM_BLOCKS, 1), dim3(NUM_COUNT, 1) >> > (d_data, d_NL, N, M, dc2, h_max_min[0], h_max_min[1], h_max_min[2], h_max_min[3], rc_inv, d_cell_counts, d_cell_count_sum, d_cell_contents); cudaMemcpy(h_NL, d_NL, N * sizeof(int), cudaMemcpyDeviceToHost); cudaFree(d_NL); cudaFree(d_data); }如何调整线程大小
最新发布
06-13
#include <iostream> #include <vector> #include <cmath> #include <algorithm> #include <numeric> #include <limits> #include <time.h> #include <fstream> #include <sstream> #include <string> #include <cuda_runtime.h> #define USE_GPU 1 using namespace std; extern "C" void launch_kernel_DPC_computeRho(float* h_data, int* h_NL, int N, int M, int dc); extern "C" void launch_kernel_DPC_computeDeltaAndNearestHigherD(float* h_delta, float* h_nearest_higher, int N, int M, int* h_indices, float* h_data); extern "C" void launch_thrust_findClusterCenters(float* h_rho_delta, int* ord_clusters, int N); extern "C" void launch_thrust_assignClusters(float* h_ord_rho, int* h_ord_indices, int N); extern "C" void launch_kernel_DPC_computeRhoR(float* h_data, int* h_NL, int N, int M, int dc); extern "C" void launch_kernel_DPC_computeRhoM(float* h_data, int* h_NL, int N, int M, int dc, int* h_max_min); struct DataPoint { vector<double> coordinates; double rho = 0.0; // 局部密度 double delta = 0.0; // 到更高密度点的最小距离 int nearest_higher = -1; // 最近更高密度点的索引 int cluster = -1; // 聚类标签,-1表示未分类 }; float* h_NN; int* h_NL; int* h_row; int* h_col; float* dc_2; float* h_data; float DC; float* h_min_dist; int* h_min_idx; // 计算欧氏距离 double euclideanDistance(const DataPoint& a, const DataPoint& b) { double dist = 0.0; for (size_t i = 0; i < a.coordinates.size(); ++i) { dist += pow(a.coordinates[i] - b.coordinates[i], 2); } return sqrt(dist); } // 计算距离矩阵 float* computeDistanceMatrix(const vector<DataPoint>& points) { size_t n = points.size(); float* distMatrix; cudaMallocHost((void**)&distMatrix, n * n * sizeof(float)); for (size_t i = 0; i < n; ++i) { for (size_t j = i + 1; j < n; ++j) { double dist = euclideanDistance(points[i], points[j]); distMatrix[i * n + j] = dist; distMatrix[j * n + i] = dist; } } return distMatrix; } // 计算距离矩阵 float* computeDistanceMatrix_gpu(const vector<DataPoint>& points) { size_t n = points.size(); size_t m = points[0].coordinates.size(); float max[4] = { -9999.9,-9999.9 ,-9999.9 ,-9999.9 }; float min[4] = { 9999.9 ,9999.9 ,9999.9 ,9999.9 }; int max_min[4]; if (0) { for (size_t i = 0; i < n; ++i) { for (size_t j = 0; j < m; ++j) { h_data[i * m + j] = points[i].coordinates[j]; if (max[j] < points[i].coordinates[j]) { max[j] = points[i].coordinates[j]; } if (min[j] > points[i].coordinates[j]) { min[j] = points[i].coordinates[j]; } } } max_min[0] = int((max[0] - min[0]) / DC) + 1; max_min[1] = int((max[1] - min[1]) / DC) + 1; max_min[2] = int((max[2] - min[2]) / DC) + 1; max_min[3] = int((max[3] - min[3]) / DC) + 1; } launch_kernel_DPC_computeRho(h_data, h_NL, n, m, DC); //launch_kernel_DPC_computeRhoR(h_data, h_NL, n, m, DC); //launch_kernel_DPC_computeRhoM(h_data, h_NL, n, m, DC, max_min); return h_NN; } // 使用截断距离法计算局部密度 void computeRho(vector<DataPoint>& points, float* distMatrix, double dc) { size_t n = points.size(); for (size_t i = 0; i < n; ++i) { double rho = 0.0; for (size_t j = 0; j < n; ++j) { if (i != j && distMatrix[i * n + j] < dc) { rho += 1.0; } } points[i].rho = rho; } } // 使用截断距离法计算局部密度 void computeRho_gpu(vector<DataPoint>& points, float* distMatrix, double dc) { size_t n = points.size(); for (size_t i = 0; i < n; ++i) { points[i].rho = h_NL[i]; } } // 计算delta和最近更高密度点 void computeDeltaAndNearestHigher(vector<DataPoint>& points, float* distMatrix) { size_t n = points.size(); vector<size_t> indices(n); iota(indices.begin(), indices.end(), 0); // 按密度降序排序索引 sort(indices.begin(), indices.end(), [&points](size_t a, size_t b) { return points[a].rho > points[b].rho; }); // 找到每个点的最小距离和最近更高密度点 for (size_t i = 0; i < n; ++i) { size_t current_idx = indices[i]; double min_dist = numeric_limits<double>::max(); int nearest = -1; // 检查所有密度更高的点 for (size_t j = 0; j < i; ++j) { size_t higher_idx = indices[j]; double dist = distMatrix[current_idx * n + higher_idx]; if (dist < min_dist) { min_dist = dist; nearest = higher_idx; } } if (nearest == -1) { // 处理密度最高的点 double max_dist = 0.0; for (size_t k = 0; k < n; ++k) { if (k != current_idx && distMatrix[current_idx * n + k] > max_dist) { max_dist = distMatrix[current_idx * n + k]; } } points[current_idx].delta = max_dist; points[current_idx].nearest_higher = -1; } else { points[current_idx].delta = min_dist; points[current_idx].nearest_higher = nearest; } } } // 计算delta和最近更高密度点 void computeDeltaAndNearestHigher_gpu(vector<DataPoint>& points, float* distMatrix) { int m = points[0].coordinates.size(); //vector<size_t> indices(n); //iota(indices.begin(), indices.end(), 0); //// 按密度降序排序索引 //sort(indices.begin(), indices.end(), [&points](size_t a, size_t b) { // return points[a].rho > points[b].rho; // }); int n = points.size(); int* ord_indices; cudaMallocHost((void**)&ord_indices, n * sizeof(int)); for (int i = 0; i < n; i++) { ord_indices[i] = i; } float* ord_rho; cudaMallocHost((void**)&ord_rho, n * sizeof(float)); vector<float> ordrho; for (size_t i = 0; i < points.size(); ++i) { ord_rho[i] = points[i].rho; } launch_thrust_assignClusters(ord_rho, ord_indices, n); int* h_indices; float* h_delta; float* h_nearest_higher; cudaMallocHost((void**)&h_indices, n * sizeof(int)); cudaMallocHost((void**)&h_delta, n * sizeof(float)); cudaMallocHost((void**)&h_nearest_higher, n * sizeof(float)); for (int i = 0; i < n; i++) { h_indices[i] = ord_indices[i];; h_nearest_higher[i] = -1; h_delta[i] = 0.0; } launch_kernel_DPC_computeDeltaAndNearestHigherD(h_delta, h_nearest_higher, n, m, h_indices, h_data); //launch_kernel_DPC_computeDeltaAndNearestHigherR(h_delta, h_nearest_higher, n, m, h_indices, h_data); for (int i = 0; i < n; i++) { points[i].delta = h_delta[i]; points[i].nearest_higher = h_nearest_higher[i]; } cudaFreeHost(h_indices); cudaFreeHost(h_delta); cudaFreeHost(h_nearest_higher); } // 查找聚类中心(按rho*delta乘积排序) vector<int> findClusterCenters(const vector<DataPoint>& points, int n_clusters) { vector<pair<double, int>> products; for (size_t i = 0; i < points.size(); ++i) { products.emplace_back(points[i].rho * points[i].delta, i); } sort(products.rbegin(), products.rend()); vector<int> centers; for (int i = 0; i < n_clusters; ++i) { centers.push_back(products[i].second); } return centers; } // 查找聚类中心(按rho*delta乘积排序) vector<int> findClusterCenters_gpu(const vector<DataPoint>& points, int n_clusters) { int n = points.size(); float* rho_delta; int* ord_clusters; cudaMallocHost((void**)&rho_delta, n * sizeof(float)); cudaMallocHost((void**)&ord_clusters, n * sizeof(int)); for (size_t i = 0; i < points.size(); ++i) { rho_delta[i] = points[i].rho * points[i].delta; } launch_thrust_findClusterCenters(rho_delta, ord_clusters, n); vector<int> centers; for (int i = 0; i < n_clusters; ++i) { centers.push_back(ord_clusters[i]); } return centers; cudaFreeHost(rho_delta); cudaFreeHost(ord_clusters); } // 分配聚类标签 void assignClusters(vector<DataPoint>& points, const vector<int>& centers) { // 标记聚类中心 for (size_t i = 0; i < centers.size(); ++i) { points[centers[i]].cluster = i + 1; } // 按密度降序处理所有点 vector<size_t> indices(points.size()); iota(indices.begin(), indices.end(), 0); sort(indices.begin(), indices.end(), [&points](size_t a, size_t b) { return points[a].rho > points[b].rho; }); // 分配聚类标签 for (size_t idx : indices) { if (points[idx].cluster != -1) continue; if (points[idx].nearest_higher == -1) { points[idx].cluster = 0; // 异常点 } else { points[idx].cluster = points[points[idx].nearest_higher].cluster; } } } // 分配聚类标签 void assignClusters_gpu(vector<DataPoint>& points, const vector<int>& centers) { // 标记聚类中心 for (size_t i = 0; i < centers.size(); ++i) { points[centers[i]].cluster = i + 1; } int n = points.size(); int* ord_indices; cudaMallocHost((void**)&ord_indices, n * sizeof(int)); for (int i = 0; i < n; i++) { ord_indices[i] = i; } float* ord_rho; cudaMallocHost((void**)&ord_rho, n * sizeof(float)); for (size_t i = 0; i < points.size(); ++i) { ord_rho[i] = points[i].rho; } launch_thrust_assignClusters(ord_rho, ord_indices, n); vector<size_t> indices(points.size()); for (int i = 0; i < n; i++) { indices[i] = ord_indices[i]; } // 分配聚类标签 for (size_t idx : indices) { if (points[idx].cluster != -1) continue; if (points[idx].nearest_higher == -1) { points[idx].cluster = 0; // 异常点 } else { points[idx].cluster = points[points[idx].nearest_higher].cluster; } } } vector<DataPoint> GetData(const string& file_path) { vector<DataPoint> dataPoints; ifstream file(file_path); if (!file.is_open()) { throw runtime_error("无法打开文件: " + file_path); } string line; while (getline(file, line)) { // 跳过空行 if (line.find_first_not_of(" \t\n") == string::npos) continue; // 替换所有逗号为空格(兼容逗号分隔) replace(line.begin(), line.end(), ',', ' '); vector<double> row; DataPoint p; stringstream ss(line); double value; // 解析每行数据 while (ss >> value) { row.push_back(value); } if (!row.empty()) { row.pop_back(); p.coordinates = row; dataPoints.push_back(p); } } return dataPoints; } int main() { clock_t start, end; start = clock(); end = clock(); //cout << "Run time computeDistanceMatrix: " << (double)(end - start) << "(ms)" << endl; vector<DataPoint> points = GetData("C:/Users/admin/AppData/Roaming/feiq/Recv Files/dataset/data_500000_4.txt"); int n = points.size(); int m = points[0].coordinates.size(); cout << "数据量: " << n << endl; cout << "数据维度: " << m << endl; cudaMallocHost((void**)&h_NN, n * n * sizeof(float)); cudaMallocHost((void**)&h_NL, n * sizeof(int)); cudaMallocHost((void**)&h_data, m * n * sizeof(float)); cudaMallocHost((void**)&h_min_dist, n * sizeof(float)); cudaMallocHost((void**)&h_min_idx, n * sizeof(int)); // 参数设置 double dc = 0.1; // 截断距离 int n_clusters = 3; // 聚类数量 DC = dc; #if USE_GPU == 1 cout << "使用GPU"<< endl; #else cout << "没用GPU" << endl; #endif clock_t start_all, end_all; start_all = clock(); start = clock(); // 计算距离矩阵 #if USE_GPU == 1 auto distMatrix = computeDistanceMatrix_gpu(points); #else auto distMatrix = computeDistanceMatrix(points); #endif end = clock(); cout << "Run time computeDistanceMatrix: " << (double)(end - start) << "(ms)" << endl; start = clock(); // 计算局部密度 #if USE_GPU == 1 computeRho_gpu(points, distMatrix, dc); #else computeRho(points, distMatrix, dc); #endif end = clock(); cout << "Run time computeRho: " << (double)(end - start) << "(ms)" << endl; start = clock(); // 计算delta和最近更高密度点 #if USE_GPU == 1 computeDeltaAndNearestHigher_gpu(points, distMatrix); #else computeDeltaAndNearestHigher(points, distMatrix); #endif end = clock(); cout << "Run time computeDeltaAndNearestHigher: " << (double)(end - start) << "(ms)" << endl; start = clock(); // 查找聚类中心 #if USE_GPU == 1 auto centers = findClusterCenters_gpu(points, n_clusters); #else auto centers = findClusterCenters(points, n_clusters); #endif end = clock(); cout << "Run time findClusterCenters: " << (double)(end - start) << "(ms)" << endl; start = clock(); // 分配聚类标签 #if USE_GPU == 1 assignClusters_gpu(points, centers); #else assignClusters(points, centers); #endif end = clock(); cout << "Run time assignClusters: " << (double)(end - start)<< "(ms)" << endl; end_all = clock(); cout << "Run time all: " << (double)(end_all - start_all) << "(ms)" << endl; // 输出结果 cout << "聚类结果(显示前10个点):" << endl; for (size_t i = 0; i < 10; ++i) { cout << "点" << i << " ("; for (auto coord : points[i].coordinates) cout << coord << " "; cout << "): 聚类" << points[i].cluster << endl; } cudaFreeHost(h_NN); cudaFreeHost(h_data); return 0; }逐行分析代码
06-12
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值