Preamble
Network is hard, but "AI" demands it
Network is hard. I did rather poorly in this subject during college times and in my defense I think till this very day of writing, there are very few network health providers capable of delivering thorough and accurate checkups, saying much about the complexity of the system when even experts don't fully know what they are working with.
Still, as someone interested in deep learning and its acceleration, it's impossible to not look deeper into data communication systems when we are working with >100B models. More specifically, for large computation workloads, we pull the oldest trick in CS book and cut them into parallel or pipelined pieces, and for large DNN we usually have EP, DP, PP, TP, see
Paradigms of Parallelism | Colossal-AI
and
Expert parallelism - Amazon SageMaker
for briefings. These DNN parallelisation generate different synchronization patterns and memory access workloads. For example, EP and DP have independent parallel workloads and have perhaps the highest compute to communication ratio, so they require the least bandwidth and least stringent latency, while PP and TP on the other hand, cut models and require frequent synchronization at high speed to avoid blocking tensor/vector processors.
(an even better summary of common parallelization of LLM deployment by NV: Parallelisms — NVIDIA NeMo Framework User Guide latest documentation)
Layered Communication and Main Schools
Different parallelism and subsequent memory access workloads and synchronization requirements are hard to cater to by a single network design and in practice we use a mixture of systems at different distribution

最低0.47元/天 解锁文章
492

被折叠的 条评论
为什么被折叠?



