Burn 开源程序是下一代深度学习框架，在灵活性、效率和可移植性方面毫不妥协

最新推荐文章于 2025-12-09 16:47:37 发布

原创

最新推荐文章于 2025-12-09 16:47:37 发布 · 1k 阅读

13 ·

CC 4.0 BY-SA版权

文章标签：

#深度学习 #人工智能 #python #rust

一、软件介绍

文末提供程序和源码下载

Burn 开源程序是下一代深度学习框架，在灵活性、效率和可移植性方面毫不妥协

二、Performance 性能

因为我们相信深度学习框架的目标是将计算转化为有用的智能，所以我们将性能作为 Burn 的核心支柱。我们努力通过利用下述多种优化技术来实现最高效率。

自动内核融合 💥

Using Burn means having your models optimized on any backend. When possible, we provide a way to automatically and dynamically create custom kernels that minimize data relocation between different memory spaces, extremely useful when moving memory is the bottleneck.
使用 Burn 意味着在任何后端优化您的模型。在可能的情况下，我们提供一种方法自动动态创建自定义内核，最大限度地减少不同之间的数据重新定位内存空间，当移动内存是瓶颈时非常有用。

As an example, you could write your own GELU activation function with the high level tensor api (see Rust code snippet below).
例如，您可以使用高级张量 api 编写自己的 GELU 激活函数（参见 Rust 代码片段）。

fn gelu_custom<B: Backend, const D: usize>(x: Tensor<B, D>) -> Tensor<B, D> {
    let x = x.clone() * ((x / SQRT_2).erf() + 1);
    x / 2
}

Then, at runtime, a custom low-level kernel will be automatically created for your specific implementation and will rival a handcrafted GPU implementation. The kernel consists of about 60 lines of WGSL WebGPU Shading Language, an extremely verbose lower level shader language you probably don't want to program your deep learning models in!
然后，在运行时，将自动为您的特定实现，并将与手工制作的 GPU 实现相媲美。内核由大约 60 个 WGSL WebGPU 着色语言行，一种极其冗长的低级着色器语言，您可能不想对 Deep 进行编程学习模型！

线程安全构建块 🦞

Burn emphasizes thread safety by leveraging the ownership system of Rust. With Burn, each module is the owner of its weights. It is therefore possible to send a module to another thread for computing the gradients, then send the gradients to the main thread that can aggregate them, and voilà, you get multi-device training.
Burn 通过利用 Rust 的所有权系统来强调线程安全。使用 Burn 时，每个模块都是其权重的所有者。因此，可以将一个模块发送到另一个线程来计算梯度，然后将梯度发送到可以聚合它们的主线程，瞧，您可以获得多设备训练。

This is a very different approach from what PyTorch does, where backpropagation actually mutates the grad attribute of each tensor parameter. This is not a thread-safe operation and therefore requires lower level synchronization primitives, see distributed training for reference. Note that this is still very fast, but not compatible across different backends and quite hard to implement.
这是一种与 PyTorch 截然不同的方法，在 PyTorch 中，反向传播实际上会改变每个张量参数的 grad 属性。这不是线程安全的作，因此需要较低级别的同步基元，请参阅分布式训练作为参考。请注意，这仍然非常快，但不兼容不同的后端，并且很难实现。

智能内存管理 🦀

One of the main roles of a deep learning framework is to reduce the amount of memory necessary to run models. The naive way of handling memory is that each tensor has its own memory space, which is allocated when the tensor is created then deallocated as the tensor gets out of scope. However, allocating and deallocating data is very costly, so a memory pool is often required to achieve good throughput. Burn offers an infrastructure that allows for easily creating and selecting memory management strategies for backends. For more details on memory management in Burn, see this blog post.
深度学习框架的主要作用之一是减少运行模型所需的内存量。处理内存的幼稚方法是每个张量都有自己的内存空间，该空间在创建张量时分配，然后在张量超出范围时释放。但是，分配和取消分配数据的成本非常高，因此通常需要内存池来实现良好的吞吐量。Burn 提供了一个基础设施，允许轻松创建和选择后端的内存管理策略。有关 Burn 中内存管理的更多详细信息，请参阅此博客文章。

Another very important memory optimization of Burn is that we keep track of when a tensor can be mutated in-place just by using the ownership system well. Even though it is a rather small memory optimization on its own, it adds up considerably when training or running inference with larger models and contributes to reduce the memory usage even more. For more information, see this blog post about tensor handling.
Burn 的另一个非常重要的内存优化是，我们只需很好地使用所有权系统即可跟踪何时可以就地改变张量。尽管它本身是一个相当小的内存优化，但在使用大型模型进行训练或运行推理时，它会大大增加，并有助于进一步减少内存使用量。有关更多信息，请参阅这篇关于张量处理的博客文章。

自动内核选择 🎯

A good deep learning framework should ensure that models run smoothly on all hardware. However, not all hardware share the same behavior in terms of execution speed. For instance, a matrix multiplication kernel can be launched with many different parameters, which are highly sensitive to the size of the matrices and the hardware. Using the wrong configuration could reduce the speed of execution by a large factor (10 times or even more in extreme cases), so choosing the right kernels becomes a priority.
一个好的深度学习框架应该确保模型在所有硬件上都能顺利运行。然而，不是所有硬件在执行速度方面都具有相同的行为。例如，矩阵乘法内核可以使用许多不同的参数启动，这些参数对矩阵和硬件的大小。使用错误的配置可能会降低执行系数大（极端情况下为 10 倍甚至更多），因此选择合适的内核成为优先事项。

With our home-made backends, we run benc