混合精度、异构计算——杂记

原创

已于 2024-08-07 17:01:56 修改 · 848 阅读

·

4

·

CC 4.0 BY-SA版权

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

文章标签：

#异构计算 #混合精度 #人工计算 #HPC

于 2024-08-04 10:34:25 首次发布

1、英伟达GPU架构

1.1 GPU架构

Figure 1 shows a full GA100 GPU with 128 SMs. The A100 is based on GA100 and has 108 SMs.

SM是streaming multiprocessor的简写，4个处理单元组成一个SM，如Figure 2。

每个SM有64个INT32，64个FP32，32个FP64的CUDA core；每个SM还有4个Tensor Core。SM内共享L1缓存。

CUDA Core是用作通用计算的，Tensor Core是专门针对深度学习优化的，负责矩阵运算、混合精度运算。

Figure 1 GA100 Full GPU with 128 SMs. The A100 Tensor Core GPU has 108 SMs.

Figure 2. The GA100 streaming multiprocessor (SM)

Figure 3展示了NVDIA不同代GPU的特性。

Figure 3. 各代GPU架构

1.2 Tensor Core

Tensor Core是从2017年的Volta架构开始演变的针对AI模型大量乘加运算的特殊处理单元。

在A100中，Tensor Core3.0能够支持所有的数据类型，包括FP16, BF16, TF32, FP64, INT8, INT4。提供了针对HPC用途的FP64双精度强大算力。

混合精度在底层硬件算子层面，使用半精度（FP16）作为输入和输出，使用全精度（FP32）进行中间结果计算从而不损失过多精度的技术。这个底层硬件层面其实指的就是Tensor Core，所以GPU上有Tensor Core是使用混合精度训练加速的必要条件。

可以说Tensor Core 是混合精度训练的底层硬件支持。

Tensor Core 的基本运算单元为 D = A*B + C，其中A、B、C、D 均为矩阵。每个 Tensor Core 能在一个时钟周期内完成 4*4 的 mma (matrix multiply accumalation) 运算，即一次矩阵乘法和一次矩阵加法。 (D = A*B + C是深度学习中最重要的算子，output = weight * input + bias)

Figure 4. 使用FP16做乘法，使用FP32做加法（避免大数吃小数）

2. MAGMA库

2.1 MAGMA的介绍

MAGMA is intended for CUDA enabled NVIDIA GPUs and HIP enabled AMD GPUs.

It supports NVIDIA's Kepler, Maxwell, Pascal, Volta, Turing, Ampere, and Hopper

GPUs, and AMD's HIP GPUs.

Included are routines for the following algorithms:

* LU, QR, and Cholesky factorizations

* Hessenberg, bidiagonal, and tridiagonal reductions

* Linear solvers based on LU, QR, and Cholesky

* Eigenvalue and singular value (SVD) problem solvers

* Generalized Hermitian-definite eigenproblem solver

* Mixed-precision iterative refinement solvers based on LU, QR, and Cholesky

* MAGMA BLAS including gemm, gemv, symv, and trsm

* Batched MAGMA BLAS including gemm, gemv, herk, and trsm

* Batched MAGMA LAPACK including LU, inverse (getri), QR, and Cholesky factorizations

* MAGMA Sparse including CG, GMRES, BiCGSTAB, LOBPCG, iterative refinement,

preconditioners, sparse kernels (SpMV, SpMM), and support for CSR, ELL, and

SELL-P data formats

MAGMA的Dense Linear Algebra库是有混合精度算法（指Iterative Refinement，在MAGMA 2.5版本发布）。

比如下面两个接口magma_dsgesv_iteref_gpu和magma_dhgesv_iteref_gpu：

1- The FP32 to FP64 API magma_dsgesv_iteref_gpu, which is similar to the L

最低0.47元/天解锁文章

评论

成就一亿技术人!

拼手气红包6.0元

还能输入1000个字符

添加红包

插入表情

表情包

代码片

HTML/XML
objective-c
Ruby
PHP
C
C++
JavaScript
Python
Java
CSS
SQL
其它

条评论被折叠查看

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。