1、英伟达GPU架构
1.1 GPU架构
Figure 1 shows a full GA100 GPU with 128 SMs. The A100 is based on GA100 and has 108 SMs.
SM是streaming multiprocessor的简写,4个处理单元组成一个SM,如Figure 2。
每个SM有64个INT32,64个FP32,32个FP64的CUDA core;每个SM还有4个Tensor Core。SM内共享L1缓存。
CUDA Core是用作通用计算的,Tensor Core是专门针对深度学习优化的,负责矩阵运算、混合精度运算。

Figure 1 GA100 Full GPU with 128 SMs. The A100 Tensor Core GPU has 108 SMs.

Figure 2. The GA100 streaming multiprocessor (SM)
Figure 3展示了NVDIA不同代GPU的特性。

Figure 3. 各代GPU架构
1.2 Tensor Core
Tensor Core是从2017年的Volta架构开始演变的针对AI模型大量乘加运算的特殊处理单元。
在A100中,Tensor Core3.0能够支持所有的数据类型,包括FP16, BF16, TF32, FP64, INT8, INT4。提供了针对HPC用途的FP64双精度强大算力。
混合精度在底层硬件算子层面,使用半精度(FP16)作为输入和输出,使用全精度(FP32)进行中间结果计算从而不损失过多精度的技术。这个底层硬件层面其实指的就是Tensor Core,所以GPU上有Tensor Core是使用混合精度训练加速的必要条件。
可以说Tensor Core 是混合精度训练的底层硬件支持。
Tensor Core 的基本运算单元为 D = A*B + C,其中A、B、C、D 均为矩阵。每个 Tensor Core 能在一个时钟周期内完成 4*4 的 mma (matrix multiply accumalation) 运算,即一次矩阵乘法和一次矩阵加法。 (D = A*B + C是深度学习中最重要的算子,output = weight * input + bias)
Figure 4. 使用FP16做乘法,使用FP32做加法(避免大数吃小数)
2. MAGMA库
2.1 MAGMA的介绍
MAGMA is intended for CUDA enabled NVIDIA GPUs and HIP enabled AMD GPUs.
It supports NVIDIA's Kepler, Maxwell, Pascal, Volta, Turing, Ampere, and Hopper
GPUs, and AMD's HIP GPUs.
Included are routines for the following algorithms:
* LU, QR, and Cholesky factorizations
* Hessenberg, bidiagonal, and tridiagonal reductions
* Linear solvers based on LU, QR, and Cholesky
* Eigenvalue and singular value (SVD) problem solvers
* Generalized Hermitian-definite eigenproblem solver
* Mixed-precision iterative refinement solvers based on LU, QR, and Cholesky
* MAGMA BLAS including gemm, gemv, symv, and trsm
* Batched MAGMA BLAS including gemm, gemv, herk, and trsm
* Batched MAGMA LAPACK including LU, inverse (getri), QR, and Cholesky factorizations
* MAGMA Sparse including CG, GMRES, BiCGSTAB, LOBPCG, iterative refinement,
preconditioners, sparse kernels (SpMV, SpMM), and support for CSR, ELL, and
SELL-P data formats
MAGMA的Dense Linear Algebra库是有混合精度算法(指Iterative Refinement,在MAGMA 2.5版本发布)。
比如下面两个接口magma_dsgesv_iteref_gpu和magma_dhgesv_iteref_gpu:
1- The FP32 to FP64 API magma_dsgesv_iteref_gpu, which is similar to the L

最低0.47元/天 解锁文章
694

被折叠的 条评论
为什么被折叠?



