🎯要点
- 使用英伟达 V100 GPU计算测试分析。
- 计算斯特拉顿-楚矢量衍射积分,使用蒙特卡洛法计算分析聚焦激光场粒子与电磁场之间相互作用。
- 使用曲面积分形式表示抛物面镜矢量衍射积分。
- 使用切比雪夫微分矩阵法解振荡积分。
- 使用一种洛伦兹力跳蛙算法解带电粒子与激光脉冲碰撞的轨迹。
🍁CUDA蒙特卡洛
- CUDA©磁态蒙特卡洛和传输矩阵多GPU并行计算分析
🍪语言内容分比
🍇CUDA张量计算
NVIDIA Tensor Core 专门用于执行混合精度的广义矩阵乘法运算,即广义矩阵乘法输入矩阵精度较低,而广义矩阵乘法输出矩阵精度较高。混合精度训练和推理是加速神经网络训练和推理的关键技术。
D
=
(
A
0
,
0
A
0
,
1
A
0
,
2
A
0
,
3
A
1
,
0
A
1
,
1
A
1
,
2
A
1
,
3
A
2
,
0
A
2
,
1
A
2
,
2
A
2
,
3
A
3
,
0
A
3
,
1
A
3
,
2
A
3
,
3
)
(
B
0
,
0
B
0
,
1
B
0
,
2
B
0
,
3
B
1
,
0
B
1
,
1
B
1
,
2
B
1
,
3
B
2
,
0
B
2
,
1
B
2
,
2
B
2
,
3
B
3
,
0
B
3
,
1
B
3
,
2
B
3
,
3
)
+
(
C
0
,
0
C
0
,
1
C
0
,
2
C
0
,
3
C
1
,
0
C
1
,
1
C
1
,
2
C
1
,
3
C
2
,
0
C
2
,
1
C
2
,
2
C
2
,
3
C
3
,
0
C
3
,
1
C
3
,
2
C
3
,
3
)
D =\left(\begin{array}{|l|l|l|l|} \hline A_{0,0} & A_{0,1} & A_{0,2} & A_{0,3} \\ \hline A_{1,0} & A_{1,1} & A_{1,2} & A_{1,3} \\ \hline A_{2,0} & A_{2,1} & A_{2,2} & A_{2,3} \\ \hline A_{3,0} & A_{3,1} & A_{3,2} & A_{3,3} \\ \hline \end{array}\right)\left(\begin{array}{|l|l|l|l|} \hline B_{0,0} & B_{0,1} & B_{0,2} & B_{0,3} \\ \hline B_{1,0} & B_{1,1} & B_{1,2} & B_{1,3} \\ \hline B_{2,0} & B_{2,1} & B_{2,2} & B_{2,3} \\ \hline B_{3,0} & B_{3,1} & B_{3,2} & B_{3,3} \\ \hline \end{array}\right) \quad+\left(\begin{array}{|l|l|l|l|} \hline C_{0,0} & C_{0,1} & C_{0,2} & C_{0,3} \\ \hline C_{1,0} & C_{1,1} & C_{1,2} & C_{1,3} \\ \hline C_{2,0} & C_{2,1} & C_{2,2} & C_{2,3} \\ \hline C_{3,0} & C_{3,1} & C_{3,2} & C_{3,3} \\ \hline \end{array}\right)
D=
A0,0A1,0A2,0A3,0A0,1A1,1A2,1A3,1A0,2A1,2A2,2A3,2A0,3A1,3A2,3A3,3
B0,0B1,0B2,0B3,0B0,1B1,1B2,1B3,1B0,2B1,2B2,2B3,2B0,3B1,3B2,3B3,3
+
C0,0C1,0C2,0C3,0C0,1C1,1C2,1C3,1C0,2C1,2C2,2C3,2C0,3C1,3C2,3C3,3
由于 NVIDIA Tensor Cores 是专为广义矩阵乘法设计的,因此使用 NVIDIA Tensor Core 的广义矩阵乘法吞吐量比使用更适合更通用的并行编程的 NVIDIA CUDA Cores 所能实现的吞吐量高得多。
NVIDIA CUDA 允许用户在 warp 级别编程 Tensor Core 广义矩阵乘法计算。虽然每个 Tensor Core 只能针对不同数据类型执行某些特定小尺寸的矩阵乘法,但大型广义矩阵乘法可以分为多个小型广义矩阵乘法并进行累积。
A = [ A 1 , 1 d b m × d b k A 1 , 2 d b m × d b k … A 1 , k d m m × d b k A 2 , 1 d m m × d b k A 2 , 2 d b m × d b k ⋯ A 2 , k / d b k d b m × d b k ⋮ ⋮ ⋱ ⋮ A m / d m m , 1 d m b × d b k A m / d m m , 2 d b m × d b k ⋯ A m / d b m , k / d b k d b m × d b k ] A=\left[\begin{array}{cccc} A_{1,1}^{d_{b m} \times d_{b k}} & A_{1,2}^{d_{b m} \times d_{b k}} & \ldots & A_{1, k}^{d_{m m} \times d_{b k}} \\ A_{2,1}^{d_{m m} \times d_{b k}} & A_{2,2}^{d_{b m} \times d_{b k}} & \cdots & A_{2, k / d_{b k}}^{d_{b m} \times d_{b k}} \\ \vdots & \vdots & \ddots & \vdots \\ A_{m / d_{m m}, 1}^{d_{m b} \times d_{b k}} & A_{m / d_{m m}, 2}^{d_{b m} \times d_{b k}} & \cdots & A_{m / d_{b m}, k / d_{b k}}^{d_{b m} \times d_{b k}} \end{array}\right] A= A1,1dbm×dbkA2,1dmm×dbk⋮Am/dmm,1dmb×dbkA1,2dbm×dbkA2,2dbm×dbk⋮Am/dmm,2dbm×dbk…⋯⋱⋯A1,kdmm×dbkA2,k/dbkdbm×dbk⋮Am/dbm,k/dbkdbm×dbk
B = [ B 1 , 1 d b k × d b n B 1 , 2 d b k × d b n … B 1 , n / d b n d b k × d b n B 2 , 1 d b k × d b n B 2 , 2 d b k × d b n … B 2 , n / d b n d b k × d b n ⋮ ⋮ ⋱ ⋮ B k / d b k , 1 d b k d b n B k / d b k , 2 d b k × d b n ⋯ B k / d b k , n / d b n d b k × d b n ] B=\left[\begin{array}{cccc} B_{1,1}^{d_{b k} \times d_{b n}} & B_{1,2}^{d_{b k} \times d_{b n}} & \ldots & B_{1, n / d_{b n}}^{d_{b k} \times d_{b n}} \\ B_{2,1}^{d_{b k} \times d_{b n}} & B_{2,2}^{d_{b k} \times d_{b n}} & \ldots & B_{2, n / d_{b n}}^{d_{b k} \times d_{b n}} \\ \vdots & \vdots & \ddots & \vdots \\ B_{k / d_{b k}, 1}^{d_{b k} d_{b n}} & B_{k / d_{b k}, 2}^{d_{b k} \times d_{b n}} & \cdots & B_{k / d_{b k}, n / d_{b n}}^{d_{b k} \times d_{b n}} \end{array}\right] B= B1,1dbk×dbnB2,1dbk×dbn⋮Bk/dbk,1dbkdbnB1,2dbk×dbnB2,2dbk×dbn⋮Bk/dbk,2dbk×dbn……⋱⋯B1,n/dbndbk×dbnB2,n/dbndbk×dbn⋮Bk/dbk,n/dbndbk×dbn
C = [ C 1 , 1 d b m × d b n C 1 , 2 d b m × d b n … C 1 , n / d b n d b m × d b n C 2 , 1 d b m × d b n C 2 , 2 d b m × d b n … C 2 , n / d b n d b m × d m n ⋮ ⋮ ⋱ ⋮ C m / d b m , 1 d b m × d l n C m / d b m , 2 d b n × d b n ⋯ C m / d b m , n / d b n d b m × d b n ] C=\left[\begin{array}{cccc} C_{1,1}^{d_{b m} \times d_{b n}} & C_{1,2}^{d_{b m} \times d_{b n}} & \ldots & C_{1, n / d_{b n}}^{d_{b m} \times d_{b n}} \\ C_{2,1}^{d_{b m} \times d_{b n}} & C_{2,2}^{d_{b m} \times d_{b n}} & \ldots & C_{2, n / d_{b n}}^{d_{b m} \times d_{m n}} \\ \vdots & \vdots & \ddots & \vdots \\ C_{m / d_{b m}, 1}^{d_{b m} \times d_{l n}} & C_{m / d_{b m}, 2}^{d_{b n} \times d_{b n}} & \cdots & C_{m / d_{b m}, n / d_{b n}}^{d_{b m} \times d_{b n}} \end{array}\right] C= C1,1dbm×dbnC2,1dbm×dbn⋮Cm/dbm,1dbm×dlnC1,2dbm×dbnC2,2dbm×dbn⋮Cm/dbm,2dbn×dbn……⋱⋯C1,n/dbndbm×dbnC2,n/dbndbm×dmn⋮Cm/dbm,n/dbndbm×dbn
D = [ D 1 , 1 d b m × d b n D 1 , 2 d b m × d b n … D 1 , n / d b n d b m × d b n D 2 , 1 d b m × d b n D 2 , 2 d b m × d b n … D 2 , n / d b n d b m × d m n ⋮ ⋮ ⋱ ⋮ D m / d b m , 1 d b m × d b n D m / d b m , 2 d b m × d b n ⋯ D m / d b m , n / d b n d b n × d l m ] D=\left[\begin{array}{cccc} D_{1,1}^{d_{b m} \times d_{b n}} & D_{1,2}^{d_{b m} \times d_{b n}} & \ldots & D_{1, n / d_{b n}}^{d_{b m} \times d_{b n}} \\ D_{2,1}^{d_{b m} \times d_{b n}} & D_{2,2}^{d_{b m} \times d_{b n}} & \ldots & D_{2, n / d_{b n}}^{d_{b m} \times d_{m n}} \\ \vdots & \vdots & \ddots & \vdots \\ D_{m / d_{b m}, 1}^{d_{b m} \times d_{b n}} & D_{m / d_{b m}, 2}^{d_{b m} \times d_{b n}} & \cdots & D_{m / d_{b m}, n / d_{b n}}^{d_{b n} \times d_{l_m}} \end{array}\right] D= D1,1dbm×dbnD2,1dbm×dbn⋮Dm/dbm,1dbm×dbnD1,2dbm×dbnD2,2dbm×dbn⋮Dm/dbm,2dbm×dbn……⋱⋯D1,n/dbndbm×dbnD2,n/dbndbm×dmn⋮Dm/dbm,n/dbndbn×dlm
D
D
D中的每个小矩阵都被计算为多个小的广义矩阵乘法并进行累积。
D
i
m
,
i
n
d
×
d
=
∑
i
k
=
1
k
/
d
A
i
m
,
i
k
d
×
d
B
i
k
,
i
n
d
×
d
D_{i_m, i_n}^{d \times d}=\sum_{i_k=1}^{k / d} A_{i_m, i_k}^{d \times d} B_{i_k, i_n}^{d \times d}
Dim,ind×d=ik=1∑k/dAim,ikd×dBik,ind×d
在此,将主要关注广义矩阵乘法运算中的矩阵乘法部分,令
C
=
0
C = 0
C=0。
#include <cassert>
#include <chrono>
#include <functional>
#include <iomanip>
#include <iostream>
#include <random>
#include <utility>
#include <vector>
#include <cuda_runtime.h>
#include <mma.h>
#define CHECK_CUDA_ERROR(val) check((val), #val, __FILE__, __LINE__)
template <typename T>
void check(T err, const char* const func, const char* const file,
int const line)
{
if (err != cudaSuccess)
{
std::cerr << "CUDA Runtime Error at: " << file << ":" << line
<< std::endl;
std::cerr << cudaGetErrorString(err) << " " << func << std::endl;
std::exit(EXIT_FAILURE);
}
}
#define CHECK_LAST_CUDA_ERROR() checkLast(__FILE__, __LINE__)
void checkLast(const char* const file, int const line)
{
cudaError_t const err{cudaGetLastError()};
if (err != cudaSuccess)
{
std::cerr << "CUDA Runtime Error at: " << file << ":" << line
<< std::endl;
std::cerr << cudaGetErrorString(err) << std::endl;
std::exit(EXIT_FAILURE);
}
}
template <class T>
float measure_performance(std::function<T(cudaStream_t)> bound_function,
cudaStream_t stream, int num_repeats = 100,
int num_warmups = 100)
{
cudaEvent_t start, stop;
float time;
CHECK_CUDA_ERROR(cudaEventCreate(&start));
CHECK_CUDA_ERROR(cudaEventCreate(&stop));
for (int i{0}; i < num_warmups; ++i)
{
bound_function(stream);
}
CHECK_CUDA_ERROR(cudaStreamSynchronize(stream));
CHECK_CUDA_ERROR(cudaEventRecord(start, stream));
for (int i{0}; i < num_repeats; ++i)
{
bound_function(stream);
}
CHECK_CUDA_ERROR(cudaEventRecord(stop, stream));
CHECK_CUDA_ERROR(cudaEventSynchronize(stop));
CHECK_LAST_CUDA_ERROR();
CHECK_CUDA_ERROR(cudaEventElapsedTime(&time, start, stop));
CHECK_CUDA_ERROR(cudaEventDestroy(start));
CHECK_CUDA_ERROR(cudaEventDestroy(stop));
float const latency{time / num_repeats};
return latency;
}
template <typename T1, typename T2, int WMMA_M, int WMMA_N, int WMMA_K,
typename WMMA_FRAG_LAYOUT_A, typename WMMA_FRAG_LAYOUT_B>
__global__ void wmma_gemm_a_col_major_b_col_major(
T1 const* A, T1 const* B, T2* C, uint32_t m, uint32_t n, uint32_t k,
uint32_t lda, uint32_t ldb, uint32_t ldc, bool is_A_transpose,
bool is_B_transpose, float alpha, float beta)
{
uint32_t const warpM{(blockIdx.x * blockDim.x + threadIdx.x) / warpSize};
uint32_t const warpN{blockIdx.y * blockDim.y + threadIdx.y};
nvcuda::wmma::fragment<nvcuda::wmma::matrix_a, WMMA_M, WMMA_N, WMMA_K, T1,
WMMA_FRAG_LAYOUT_A>
a_frag{};
nvcuda::wmma::fragment<nvcuda::wmma::matrix_b, WMMA_M, WMMA_N, WMMA_K, T1,
WMMA_FRAG_LAYOUT_B>
b_frag{};
nvcuda::wmma::fragment<nvcuda::wmma::accumulator, WMMA_M, WMMA_N, WMMA_K,
T2>
acc_frag{};
nvcuda::wmma::fragment<nvcuda::wmma::accumulator, WMMA_M, WMMA_N, WMMA_K,
T2>
c_frag{};
nvcuda::wmma::fill_fragment(acc_frag, static_cast<T2>(0));
for (uint32_t ki{0}; ki < k; ki += WMMA_K)
{
uint32_t const matrix_mma_a_row_idx{is_A_transpose ? ki
: warpM * WMMA_M};
uint32_t const matrix_mma_a_col_idx{is_A_transpose ? warpM * WMMA_M
: ki};
uint32_t const matrix_mma_b_row_idx{is_B_transpose ? warpN * WMMA_N
: ki};
uint32_t const matrix_mma_b_col_idx{is_B_transpose ? ki
: warpN * WMMA_N};
if (matrix_mma_a_row_idx < (is_A_transpose ? k : m) &&
matrix_mma_a_col_idx < (is_A_transpose ? m : k) &&
matrix_mma_b_row_idx < (is_B_transpose ? n : k) &&
matrix_mma_b_col_idx < (is_B_transpose ? k : n))
{
T1 const* matrix_mma_a_mptr{A + matrix_mma_a_row_idx +
matrix_mma_a_col_idx * lda};
T1 const* matrix_mma_b_mptr{B + matrix_mma_b_row_idx +
matrix_mma_b_col_idx * ldb};
nvcuda::wmma::load_matrix_sync(a_frag, matrix_mma_a_mptr, lda);
nvcuda::wmma::load_matrix_sync(b_frag, matrix_mma_b_mptr, ldb);
nvcuda::wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);
}
}
uint32_t const matrix_mma_c_row_idx{warpM * WMMA_M};
uint32_t const matrix_mma_c_col_idx{warpN * WMMA_N};
if (matrix_mma_c_row_idx < m && matrix_mma_c_col_idx < n)
{
T2* matrix_mma_c_mptr{C + matrix_mma_c_row_idx +
matrix_mma_c_col_idx * ldc};
nvcuda::wmma::load_matrix_sync(c_frag, matrix_mma_c_mptr, ldc,
nvcuda::wmma::mem_col_major);
for (uint32_t i = 0; i < c_frag.num_elements; i++)
{
c_frag.x[i] = alpha * acc_frag.x[i] + beta * c_frag.x[i];
}
nvcuda::wmma::store_matrix_sync(matrix_mma_c_mptr, c_frag, ldc,
nvcuda::wmma::mem_col_major);
}
}
template <typename T1, typename T2>
void launch_wmma_mm(T1 const* A, T1 const* B, T2* C, uint32_t m, uint32_t n,
uint32_t k, bool is_A_transpose, bool is_B_transpose,
cudaStream_t stream)
{
uint32_t const lda{is_A_transpose ? k : m};
uint32_t const ldb{is_B_transpose ? n : k};
uint32_t const ldc{m};
float const alpha{1.0f};
float const beta{0.0f};
constexpr int WMMA_M{16};
constexpr int WMMA_N{16};
constexpr int WMMA_K{16};
constexpr int WARP_SIZE{32};
dim3 gridDim;
dim3 blockDim;
int const num_warps_x = 4;
int const num_warps_y = 4;
blockDim.x = num_warps_x * WARP_SIZE;
blockDim.y = num_warps_y;
gridDim.x = (m + (WMMA_M * num_warps_x - 1)) / (WMMA_M * num_warps_x);
gridDim.y = (n + WMMA_N * num_warps_y - 1) / (WMMA_N * num_warps_y);
if ((!is_A_transpose) && (!is_B_transpose))
{
wmma_gemm_a_col_major_b_col_major<T1, T2, WMMA_M, WMMA_N, WMMA_K,
nvcuda::wmma::col_major,
nvcuda::wmma::col_major>
<<<gridDim, blockDim, 0, stream>>>(A, B, C, m, n, k, lda, ldb, ldc,
is_A_transpose, is_B_transpose,
alpha, beta);
}
else if ((is_A_transpose) && (!is_B_transpose))
{
wmma_gemm_a_col_major_b_col_major<T1, T2, WMMA_M, WMMA_N, WMMA_K,
nvcuda::wmma::row_major,
nvcuda::wmma::col_major>
<<<gridDim, blockDim, 0, stream>>>(A, B, C, m, n, k, lda, ldb, ldc,
is_A_transpose, is_B_transpose,
alpha, beta);
}
else if ((!is_A_transpose) && (is_B_transpose))
{
wmma_gemm_a_col_major_b_col_major<T1, T2, WMMA_M, WMMA_N, WMMA_K,
nvcuda::wmma::col_major,
nvcuda::wmma::row_major>
<<<gridDim, blockDim, 0, stream>>>(A, B, C, m, n, k, lda, ldb, ldc,
is_A_transpose, is_B_transpose,
alpha, beta);
}
else
{
wmma_gemm_a_col_major_b_col_major<T1, T2, WMMA_M, WMMA_N, WMMA_K,
nvcuda::wmma::row_major,
nvcuda::wmma::row_major>
<<<gridDim, blockDim, 0, stream>>>(A, B, C, m, n, k, lda, ldb, ldc,
is_A_transpose, is_B_transpose,
alpha, beta);
}
CHECK_LAST_CUDA_ERROR();
}
template <typename T1, typename T2>
void mm_a_col_major_b_col_major(T1 const* A, T1 const* B, T2* C, uint32_t m,
uint32_t n, uint32_t k, uint32_t lda,
uint32_t ldb, uint32_t ldc, bool is_A_transpose,
bool is_B_transpose)
{
for (uint32_t ni{0}; ni < n; ++ni)
{
for (uint32_t mi{0}; mi < m; ++mi)
{
T2 accum{0};
if ((!is_A_transpose) && (!is_B_transpose))
{
for (uint32_t ki{0}; ki < k; ++ki)
{
accum += A[ki * lda + mi] * B[ni * ldb + ki];
}
}
else if ((is_A_transpose) && (!is_B_transpose))
{
for (uint32_t ki{0}; ki < k; ++ki)
{
accum += A[mi * lda + ki] * B[ni * ldb + ki];
}
}
else if ((!is_A_transpose) && (is_B_transpose))
{
for (uint32_t ki{0}; ki < k; ++ki)
{
accum += A[ki * lda + mi] * B[ki * ldb + ni];
}
}
else
{
for (uint32_t ki{0}; ki < k; ++ki)
{
accum += A[mi * lda + ki] * B[ki * ldb + ni];
}
}
C[ni * ldc + mi] = accum;
}
}
}
template <typename T1, typename T2>
void launch_mm(T1 const* A, T1 const* B, T2* C, uint32_t m, uint32_t n,
uint32_t k, bool is_A_transpose, bool is_B_transpose)
{
uint32_t const lda{is_A_transpose ? k : m};
uint32_t const ldb{is_B_transpose ? n : k};
uint32_t const ldc{m};
mm_a_col_major_b_col_major(A, B, C, m, n, k, lda, ldb, ldc, is_A_transpose,
is_B_transpose);
}
void fill_random_float_values(float* arr, size_t n,
std::default_random_engine& e)
{
std::uniform_real_distribution<float> uniform_dist(-256, 256);
for (size_t i{0}; i < n; ++i)
{
arr[i] = uniform_dist(e);
}
}
void fill_random_int8_values(int8_t* arr, size_t n,
std::default_random_engine& e)
{
std::uniform_int_distribution<int8_t> uniform_dist(-128, 127);
for (size_t i{0}; i < n; ++i)
{
arr[i] = uniform_dist(e);
}
}
void fill_random_int32_values(int32_t* arr, size_t n,
std::default_random_engine& e)
{
std::uniform_int_distribution<int32_t> uniform_dist(-128, 127);
for (size_t i{0}; i < n; ++i)
{
arr[i] = uniform_dist(e);
}
}
void float2half(__half* half_arr, float const* float_arr, size_t n)
{
for (size_t i{0}; i < n; ++i)
{
half_arr[i] = __float2half(float_arr[i]);
}
}
template <typename T>
float get_avg_abs_diff_ratio(T const* arr_1, T const* arr_2, size_t n)
{
float sum_abs_diff_ratio{0};
for (size_t i{0}; i < n; ++i)
{
sum_abs_diff_ratio += std::abs(static_cast<float>(arr_1[i]) -
static_cast<float>(arr_2[i])) /
std::abs(static_cast<float>(arr_1[i]) +
static_cast<float>(arr_2[i]));
}
return sum_abs_diff_ratio / n;
}
template <typename T>
bool array_equal(T const* arr_1, T const* arr_2, size_t n)
{
for (size_t i{0}; i < n; ++i)
{
if (arr_1[i] != arr_2[i])
{
return false;
}
}
return true;
}
void print_test_header(bool is_A_transpose, bool is_B_transpose)
{
if ((!is_A_transpose) && (!is_B_transpose))
{
std::cout << "C = A * B" << std::endl;
}
else if ((is_A_transpose) && (!is_B_transpose))
{
std::cout << "C = A^T * B" << std::endl;
}
else if ((!is_A_transpose) && (is_B_transpose))
{
std::cout << "C = A * B^T" << std::endl;
}
else
{
std::cout << "C = A^T * B^T" << std::endl;
}
}
int main()
{
constexpr int num_repeats{10};
constexpr int num_warmups{10};
uint32_t const matrix_size_m{1024};
uint32_t const matrix_size_n{1024};
uint32_t const matrix_size_k{1024};
std::cout << "Matrix Sizes" << std::endl;
std::cout << "M: " << matrix_size_m << std::endl;
std::cout << "N: " << matrix_size_n << std::endl;
std::cout << "K: " << matrix_size_k << std::endl;
std::default_random_engine random_engine(0);
cudaStream_t stream;
CHECK_CUDA_ERROR(cudaStreamCreate(&stream));
std::cout << "FP16 HMMA" << std::endl;
std::vector<float> matrix_a_float(matrix_size_m * matrix_size_k);
std::vector<float> matrix_b_float(matrix_size_k * matrix_size_n);
std::vector<__half> matrix_a_half(matrix_size_m * matrix_size_k);
std::vector<__half> matrix_b_half(matrix_size_k * matrix_size_n);
std::vector<float> matrix_c_float(matrix_size_m * matrix_size_n);
std::vector<float> matrix_c_float_reference(matrix_size_m * matrix_size_n);
float* h_matrix_a_float{matrix_a_float.data()};
float* h_matrix_b_float{matrix_b_float.data()};
__half* h_matrix_a_half{matrix_a_half.data()};
__half* h_matrix_b_half{matrix_b_half.data()};
float* h_matrix_c_float{matrix_c_float.data()};
float* h_matrix_c_float_reference{matrix_c_float_reference.data()};
fill_random_float_values(h_matrix_a_float, matrix_a_float.size(),
random_engine);
fill_random_float_values(h_matrix_b_float, matrix_b_float.size(),
random_engine);
fill_random_float_values(h_matrix_c_float, matrix_c_float.size(),
random_engine);
fill_random_float_values(h_matrix_c_float_reference,
matrix_c_float_reference.size(), random_engine);
float2half(h_matrix_a_half, h_matrix_a_float, matrix_a_float.size());
float2half(h_matrix_b_half, h_matrix_b_float, matrix_b_float.size());
half *d_matrix_a_half, *d_matrix_b_half;
float* d_matrix_c_float;
CHECK_CUDA_ERROR(cudaMalloc(&d_matrix_a_half,
matrix_size_m * matrix_size_k * sizeof(half)));
CHECK_CUDA_ERROR(cudaMalloc(&d_matrix_b_half,
matrix_size_k * matrix_size_n * sizeof(half)));
CHECK_CUDA_ERROR(cudaMalloc(&d_matrix_c_float,
matrix_size_m * matrix_size_n * sizeof(float)));
CHECK_CUDA_ERROR(cudaMemcpy(d_matrix_a_half, h_matrix_a_half,
matrix_a_float.size() * sizeof(__half),
cudaMemcpyHostToDevice));
CHECK_CUDA_ERROR(cudaMemcpy(d_matrix_b_half, h_matrix_b_half,
matrix_b_float.size() * sizeof(__half),
cudaMemcpyHostToDevice));
for (bool is_A_transpose : {true, false})
{
for (bool is_B_transpose : {true, false})
{
print_test_header(is_A_transpose, is_B_transpose);
launch_mm(h_matrix_a_float, h_matrix_b_float,
h_matrix_c_float_reference, matrix_size_m, matrix_size_n,
matrix_size_k, is_A_transpose, is_B_transpose);
launch_wmma_mm(d_matrix_a_half, d_matrix_b_half, d_matrix_c_float,
matrix_size_m, matrix_size_n, matrix_size_k,
is_A_transpose, is_B_transpose, stream);
CHECK_CUDA_ERROR(cudaStreamSynchronize(stream));
CHECK_CUDA_ERROR(cudaMemcpy(h_matrix_c_float, d_matrix_c_float,
matrix_c_float.size() * sizeof(float),
cudaMemcpyDeviceToHost));
float const avg_abs_diff_ratio{get_avg_abs_diff_ratio(
h_matrix_c_float, h_matrix_c_float_reference,
matrix_c_float.size())};
if (avg_abs_diff_ratio > 0.01)
{
std::cout << "Got high average absolute diff ratio: "
<< avg_abs_diff_ratio << std::endl;
}
std::function<void(cudaStream_t)> const function_hmma{std::bind(
launch_wmma_mm<__half, float>, d_matrix_a_half, d_matrix_b_half,
d_matrix_c_float, matrix_size_m, matrix_size_n, matrix_size_k,
is_A_transpose, is_B_transpose, std::placeholders::_1)};
float const latency_hmma{measure_performance(
function_hmma, stream, num_repeats, num_warmups)};
std::cout << std::fixed << std::setprecision(3)
<< "HMMA Latency: " << latency_hmma << " ms" << std::endl;
}
}
CHECK_CUDA_ERROR(cudaFree(d_matrix_a_half));
CHECK_CUDA_ERROR(cudaFree(d_matrix_b_half));
CHECK_CUDA_ERROR(cudaFree(d_matrix_c_float));
std::cout << "INT8 IMMA" << std::endl;
std::vector<int8_t> matrix_a_int8(matrix_size_m * matrix_size_k);
std::vector<int8_t> matrix_b_int8(matrix_size_k * matrix_size_n);
std::vector<int32_t> matrix_c_int32(matrix_size_m * matrix_size_n);
std::vector<int32_t> matrix_c_int32_reference(matrix_size_m *
matrix_size_n);
int8_t* h_matrix_a_int8{matrix_a_int8.data()};
int8_t* h_matrix_b_int8{matrix_b_int8.data()};
int32_t* h_matrix_c_int32{matrix_c_int32.data()};
int32_t* h_matrix_c_int32_reference{matrix_c_int32_reference.data()};
fill_random_int8_values(h_matrix_a_int8, matrix_a_int8.size(),
random_engine);
fill_random_int8_values(h_matrix_b_int8, matrix_b_int8.size(),
random_engine);
fill_random_int32_values(h_matrix_c_int32, matrix_c_int32.size(),
random_engine);
fill_random_int32_values(h_matrix_c_int32_reference,
matrix_c_int32_reference.size(), random_engine);
int8_t *d_matrix_a_int8, *d_matrix_b_int8;
int32_t* d_matrix_c_int32;
CHECK_CUDA_ERROR(cudaMalloc(
&d_matrix_a_int8, matrix_size_m * matrix_size_k * sizeof(int8_t)));
CHECK_CUDA_ERROR(cudaMalloc(
&d_matrix_b_int8, matrix_size_k * matrix_size_n * sizeof(int8_t)));
CHECK_CUDA_ERROR(cudaMalloc(
&d_matrix_c_int32, matrix_size_m * matrix_size_n * sizeof(int32_t)));
CHECK_CUDA_ERROR(cudaMemcpy(d_matrix_a_int8, h_matrix_a_int8,
matrix_a_int8.size() * sizeof(int8_t),
cudaMemcpyHostToDevice));
CHECK_CUDA_ERROR(cudaMemcpy(d_matrix_b_int8, h_matrix_b_int8,
matrix_b_int8.size() * sizeof(int8_t),
cudaMemcpyHostToDevice));
for (bool is_A_transpose : {true, false})
{
for (bool is_B_transpose : {true, false})
{
print_test_header(is_A_transpose, is_B_transpose);
launch_mm(h_matrix_a_int8, h_matrix_b_int8,
h_matrix_c_int32_reference, matrix_size_m, matrix_size_n,
matrix_size_k, is_A_transpose, is_B_transpose);
launch_wmma_mm(d_matrix_a_int8, d_matrix_b_int8, d_matrix_c_int32,
matrix_size_m, matrix_size_n, matrix_size_k,
is_A_transpose, is_B_transpose, stream);
CHECK_CUDA_ERROR(cudaStreamSynchronize(stream));
CHECK_CUDA_ERROR(cudaMemcpy(h_matrix_c_int32, d_matrix_c_int32,
matrix_c_int32.size() * sizeof(int32_t),
cudaMemcpyDeviceToHost));
assert(array_equal(h_matrix_c_int32, h_matrix_c_int32_reference,
matrix_c_int32.size()));
std::function<void(cudaStream_t)> const function_imma{
std::bind(launch_wmma_mm<int8_t, int32_t>, d_matrix_a_int8,
d_matrix_b_int8, d_matrix_c_int32, matrix_size_m,
matrix_size_n, matrix_size_k, is_A_transpose,
is_B_transpose, std::placeholders::_1)};
float const latency_imma{measure_performance(
function_imma, stream, num_repeats, num_warmups)};
std::cout << std::fixed << std::setprecision(3)
<< "IMMA Latency: " << latency_imma << " ms" << std::endl;
}
}
CHECK_CUDA_ERROR(cudaFree(d_matrix_a_int8));
CHECK_CUDA_ERROR(cudaFree(d_matrix_b_int8));
CHECK_CUDA_ERROR(cudaFree(d_matrix_c_int32));
CHECK_CUDA_ERROR(cudaStreamDestroy(stream));
}