如何去设计一个深度学习加速器?

全球逾百家公司在研发针对深度学习应用的ASIC或SoC。本文概述了深度学习推断加速器芯片的设计关键,包括矩阵乘法单元、SRAM、互联、接口与控制器等核心组件。介绍了NVIDIA DLA、SRAM IP、Arteris FlexNoC等可用IP资源,以及ASIC设计流程、成本估算和制造工艺选择。

How to make your own deep learning accelerator chip!

Currently, there are more than 100 companies all over the world building ASICs (Application Specific Integrated Circuit) or SOC’s (System on Chip) targeted towards deep learning applications. There is a long list of companies here. In addition to these startup big companies like Google (TPU), Facebook, Amazon (Inferentia), Tesla etc are all developing custom ASIC’s for deep learning training and inference. These can be categorized into two types —

  • Training and Inference — These ASIC’s are designed to handle both training the deep neural network and also performing inference. Training a large neural network like Resnet-50 is a much more compute-intensive task involving gradient descent and back-propagation. Compared to training inference is very simple and requires less computation. NVidia GPU’s, which are most popular today for deep learning, can do both training and inference. Some other examples are Graphcore IPU, Google TPU V3, Cerebras, etc. OpenAI has great analysis showing the recent increase in compute required for training large networks.
  • Inference — These ASICs are designed to run DNN’s (Deep neural networks) which have been trained on GPU or other ASIC and then trained network is modified (quantized, pruned etc) to run on a different ASIC (like Google Coral Edge TPU, NVidia Jetson Nano). Most people say that the market for deep learning inference is much bigger than the training. Even very small microcontrollers (MCU’s) based on ARM Cortex-M0, M3, M4 etc can do inference as shown by the TensorFlow Lite team.

在这里插入图片描述
Making any chip (ASIC, SOC etc) is a costly, difficult and lengthy process typically done by teams of 10 to 1000’s of people depending on the size and complexity of the chip.

Here I am only providing a brief overview specific to deep learning inference accelerator. If you have already designed chips you may find this too simple. If you are still interested, read on! If you like it share and 👏 .

Architecture of Existing ASIC’s

Lets first look at the high-level architecture of some of the accelerators currently being developed.

Habana Goya — Habana labs is a start-up which is developing separate chips for training — Gaudi and inference — Goya.
在这里插入图片描述
GEMM Engine — General matrix and multiply Engine. Matrix multiplication is the core operation in all DNN’s — convolution can be represented as matrix multiplication and fully connected layers are straight forward matrix multiplication.

TPC — Tensor processing Core — this is a block which actually performs the multiplication or multiply and accumulate (MAC) operation.

Local Memory and Shared Memory — These are both some form of cache commonly implemented using SRAM (Static Random Access Memory) and Register file (also type of static volatile memory just less dense than SRAM).

Eyeriss — The

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

wangbowj123

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值