leetgpu

最新推荐文章于 2025-12-09 20:45:00 发布

原创

最新推荐文章于 2025-12-09 20:45:00 发布 · 800 阅读

7 ·

CC 4.0 BY-SA版权

文章标签：

#算法

向量相加
1.1 题目
在GPU上实现一个程序，该程序能够对两个包含32位浮点数的向量进行逐元素相加。该程序应接受两个长度相等的输入向量，并生成一个包含它们之和的输出向量。

https://leetgpu.com/challenges/vector-addition
1.2 实施要求

不允许使用外部库
solve函数的签名必须保持不变
最终结果必须存储在向量C中

1.3 例子
[图片]

1.4 参考代码
#include <cuda_runtime.h>

global void vector_add(const float* A, const float* B, float* C, int N) {
// 计算当前线程的全局索引
int i = blockIdx.x * blockDim.x + threadIdx.x;

// 确保索引在有效范围内，避免越界访问
if (i < N) {
    C[i] = A[i] + B[i];
}

}

// A, B, C are device pointers (i.e. pointers to memory on the GPU)
extern “C” void solve(const float* A, const float* B, float* C, int N) {
int threadsPerBlock = 256;
// 计算所需的线程块数量，确保所有元素都能被处理
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

// 启动核函数，使用计算得到的网格和块配置
vector_add<<<blocksPerGrid, threadsPerBlock>>>(A, B, C, N);
// 等待所有线程完成，并检查是否有错误发生
cudaDeviceSynchronize();

}

矩阵乘法
2.1 题目
编写一个程序，在GPU上将两个32位浮点数矩阵相乘。
[图片]

2.2 例子
[图片]

2.3 代码

#include <cuda_runtime.h>

global void matrix_multiplication_kernel(const float* A, const float* B, float* C, int M, int N, int K) {
// 计算当前线程负责的C矩阵元素的行和列
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;

// 确保行和列都在有效范围内
if (row < M && col < K) {
    float sum = 0.0f;
    // 计算C[row][col]的值
    for (int i = 0; i < N; ++i) {
        sum += A[r

最低0.47元/天解锁文章