如何去设计一个深度学习加速器？

最新推荐文章于 2024-08-15 10:19:49 发布

原创

最新推荐文章于 2024-08-15 10:19:49 发布 · 3.8k 阅读

10 ·

CC 4.0 BY-SA版权

文章标签：

#深度学习 #集成电路 #芯片设计 #深度学习编译 #深度学习加速器

全球逾百家公司在研发针对深度学习应用的ASIC或SoC。本文概述了深度学习推断加速器芯片的设计关键，包括矩阵乘法单元、SRAM、互联、接口与控制器等核心组件。介绍了NVIDIA DLA、SRAM IP、Arteris FlexNoC等可用IP资源，以及ASIC设计流程、成本估算和制造工艺选择。

How to make your own deep learning accelerator chip!

Currently, there are more than 100 companies all over the world building ASICs (Application Specific Integrated Circuit) or SOC’s (System on Chip) targeted towards deep learning applications. There is a long list of companies here. In addition to these startup big companies like Google (TPU), Facebook, Amazon (Inferentia), Tesla etc are all developing custom ASIC’s for deep learning training and inference. These can be categorized into two types —

Training and Inference — These ASIC’s are designed to handle both training the deep neural network and also performing inference. Training a large neural network like Resnet-50 is a much more compute-intensive task involving gradient descent and back-propagation. Compared to training inference is very simple and requires less computation. NVidia GPU’s, which are most popular today for deep learning, can do both training and inference. Some other examples are Graphcore IPU, Google TPU V3, Cerebras, etc. OpenAI has great analysis showing the recent increase in compute required for training large networks.
Inference — These ASICs are designed to run DNN’s (Deep neural networks) which have been trained on GPU or other ASIC and then trained network is modified (quantized, pruned etc) to run on a different ASIC (like Google Coral Edge TPU, NVidia Jetson Nano). Most people say that the market for deep learning inference is much bigger than the training. Even very small microcontrollers (MCU’s) based on ARM Cortex-M0, M3, M4 etc can do inference as shown by the TensorFlow Lite team.

在这里插入图片描述
Making any chip (ASIC, SOC etc) is a costly, difficult and lengthy process typically done by teams of 10 to 1000’s of people depending on the size and complexity of the chip.

Here I am only providing a brief overview specific to deep learning inference accelerator. If you have already designed chips you may find this too simple. If you are still interested, read on! If you like it share and 👏 .

Architecture of Existing ASIC’s

Lets first look at the high-level architecture of some of the accelerators currently being developed.

Habana Goya — Habana labs is a start-up which is developing separate chips for training — Gaudi and inference — Goya.
在这里插入图片描述
GEMM Engine — General matrix and multiply Engine. Matrix multiplication is the core operation in all DNN’s — convolution can be represented as matrix multiplication and fully connected layers are straight forward matrix multiplication.

TPC — Tensor processing Core — this is a block which actually performs the multiplication or multiply and accumulate (MAC) operation.

Local Memory and Shared Memory — These are both some form of cache commonly implemented using SRAM (Static Random Access Memory) and Register file (also type of static volatile memory just less dense than SRAM).

Eyeriss — The