KANN 是一个独立的轻量级 C 语言库，用于构建和训练中小型人工神经网络，例如多层感知器、卷积神经网络和递归神经网络（包括 LSTM 和 GRU）。它实现了基于图的逆模自动微分，并允许构建具有递归等-优快云博客

本文链接：https://blog.youkuaiyun.com/struggle2025/article/details/148932141

一、软件介绍

文末提供程序和源码下载

KANN 是一个独立的轻量级 C 语言库，用于构建和训练中小型人工神经网络，例如多层感知器、卷积神经网络和递归神经网络（包括 LSTM 和 GRU）。它实现了基于图的逆模自动微分，并允许构建具有递归、共享权重和多个输入/输出/成本的拓扑复杂神经网络。与 TensorFlow 等主流深度学习框架相比，KANN 的可扩展性较低，但它的灵活性接近，代码库要小得多，并且仅依赖于标准 C 库。与 tiny-dnn 等其他轻量级框架相比，KANN 仍然更小、速度更快、用途更广，支持 RNN、VAE 和非标准神经网络，这些框架可能无法通过这些轻量级框架

如果您想在 C/C++ 中试验中小型神经网络，部署不那么大的模型而不必担心依赖关系地狱，或者学习深度学习库的内部结构，KANN 可能会很有用。

二、Features 特征

Flexible. Model construction by building a computational graph with operators. Support RNNs, weight sharing and multiple inputs/outputs.
灵活。通过使用运算符构建计算图来构建模型。支持 RNN、权重共享和多个输入/输出。
Efficient. Reasonably optimized matrix product and convolution. Support mini-batching and effective multi-threading. Sometimes faster than mainstream frameworks in their CPU-only mode.
有效。合理优化了矩阵积和卷积。支持小批量和有效的多线程。有时比纯 CPU 模式下的主流框架更快。
Small and portable. As of now, KANN has less than 4000 lines of code in four source code files, with no non-standard dependencies by default. Compatible with ANSI C compilers.
小巧便携。截至目前，KANN 的 4 个源码文件中的代码行数不到 4000 行，默认没有非标准依赖。与 ANSI C 编译器兼容。

三、Limitations 局限性

CPU only. As such, KANN is not intended for training huge neural networks.
仅限 CPU。因此，KANN 不适用于训练大型神经网络。
Lack of some common operators and architectures such as batch normalization.
缺少一些常见的运算符和架构，例如批量归一化。
Verbose APIs for training RNNs.
用于训练 RNN 的详细 API。

四、Installation 安装

The KANN library is composed of four files: kautodiff.{h,c} and kann.{h,c}. You are encouraged to include these files in your source code tree. No installation is needed. To compile examples:
KANN 库由四个文件组成： kautodiff.{h,c} 和 kann.{h,c} 。建议您将这些文件包含在源代码树中。无需安装。要编译示例：

make

This generates a few executables in the examples directory.
这将在 examples 目录中生成一些可执行文件。

A tour of basic KANN APIs
基本 KANN API 导览

Working with neural networks usually involves three steps: model construction, training and prediction. We can use layer APIs to build a simple model:
使用神经网络通常涉及三个步骤：模型构建、训练和预测。我们可以使用 layer API 来构建一个简单的模型：

kann_t *ann;
kad_node_t *t;
t = kann_layer_input(784); // for MNIST
t = kad_relu(kann_layer_dense(t, 64)); // a 64-neuron hidden layer with ReLU activation
t = kann_layer_cost(t, 10, KANN_C_CEM); // softmax output + multi-class cross-entropy cost
ann = kann_new(t, 0);                   // compile the network and collate variables

For this simple feedforward model with one input and one output, we can train it with:
对于这个具有一个输入和一个输出的简单前馈模型，我们可以使用以下命令对其进行训练：

int n;     // number of training samples
float **x; // model input, of size n * 784
float **y; // model output, of size n * 10
// fill in x and y here and then call:
kann_train_fnn1(ann, 0.001f, 64, 25, 10, 0.1f, n, x, y);

We can save the model to a file with kann_save() or use it to classify a MNIST image:
我们可以将模型保存到文件中 kann_save() ，或使用它来对 MNIST 图像进行分类：

float *x;       // of size 784
const float *y; // this will point to an array of size 10
// fill in x here and then call:
y = kann_apply1(ann, x);

Working with complex models requires to use low-level APIs. Please see 01user.md for details.
使用复杂模型需要使用低级 API。有关详细信息，请参阅 01user.md。

A complete example 一个完整的示例

This example learns to count the number of "1" bits in an integer (i.e. popcount):
此示例学习计算整数中 “1” 位的数量（即 popcount）：

// to compile and run: gcc -O2 this-prog.c kann.c kautodiff.c -lm && ./a.out
#include <stdlib.h>
#include <stdio.h>
#include "kann.h"

int main(void)
{
	int i, k, max_bit = 20, n_samples = 30000, mask = (1<<max_bit)-1, n_err, max_k;
	float **x, **y, max, *x1;
	kad_node_t *t;
	kann_t *ann;
	// construct an MLP with one hidden layers
	t = kann_layer_input(max_bit);
	t = kad_relu(kann_layer_dense(t, 64));
	t = kann_layer_cost(t, max_bit + 1, KANN_C_CEM); // output uses 1-hot encoding
	ann = kann_new(t, 0);
	// generate training data
	x = (float**)calloc(n_samples, sizeof(float*));
	y = (float**)calloc(n_samples, sizeof(float*));
	for (i = 0; i < n_samples; ++i) {
		int c, a = kad_rand(0) & (mask>>1);
		x[i] = (float*)calloc(max_bit, sizeof(float));
		y[i] = (float*)calloc(max_bit + 1, sizeof(float));
		for (k = c = 0; k < max_bit; ++k)
			x[i][k] = (float)(a>>k&1), c += (a>>k&1);
		y[i][c] = 1.0f; // c is ranged from 0 to max_bit inclusive
	}
	// train
	kann_train_fnn1(ann, 0.001f, 64, 50, 10, 0.1f, n_samples, x, y);
	// predict
	x1 = (float*)calloc(max_bit, sizeof(float));
	for (i = n_err = 0; i < n_samples; ++i) {
		int c, a = kad_rand(0) & (mask>>1); // generating a new number
		const float *y1;
		for (k = c = 0; k < max_bit; ++k)
			x1[k] = (float)(a>>k&1), c += (a>>k&1);
		y1 = kann_apply1(ann, x1);
		for (k = 0, max_k = -1, max = -1.0f; k <= max_bit; ++k) // find the max
			if (max < y1[k]) max = y1[k], max_k = k;
		if (max_k != c) ++n_err;
	}
	fprintf(stderr, "Test error rate: %.2f%%\n", 100.0 * n_err / n_samples);
	kann_delete(ann); // TODO: also to free x, y and x1
	return 0;
}

Benchmarks 基准

First of all, this benchmark only evaluates relatively small networks, but in practice, it is huge networks on GPUs that really demonstrate the true power of mainstream deep learning frameworks. Please don't read too much into the table.
首先，该基准测试仅评估相对较小的网络，但在实践中，真正展示主流深度学习框架真正强大之处在于 GPU 上的巨大网络。请不要在表格中读太多。
"Linux" has 48 cores on two Xeno E5-2697 CPUs at 2.7GHz. MKL, NumPy-1.12.0 and Theano-0.8.2 were installed with Conda; Keras-1.2.2 installed with pip. The official TensorFlow-1.0.0 wheel does not work with Cent OS 6 on this machine, due to glibc. This machine has one Tesla K40c GPU installed. We are using by CUDA-7.0 and cuDNN-4.0 for training on GPU.
“Linux”在两个 2.7GHz 的 Xeno E5-2697 CPU 上有 48 个内核。MKL 、 NumPy-1.12.0 和 Theano-0.8.2 与 Conda 一起安装;Keras-1.2.2 与 pip 一起安装。由于 glibc，官方的 TensorFlow-1.0.0 wheel 无法与这台机器上的 Cent OS 6 一起使用。这台机器安装了一个 Tesla K40c GPU。我们正在使用 CUDA-7.0 和 cuDNN-4.0 在 GPU 上进行训练。
"Mac" has 4 cores on a Core i7-3667U CPU at 2GHz. MKL, NumPy and Theano came with Conda, too. Keras-1.2.2 and Tensorflow-1.0.0 were installed with pip. On both machines, Tiny-DNN was acquired from github on March 1st, 2017.
“Mac”在 2GHz 的 Core i7-3667U CPU 上有 4 个内核。MKL、NumPy 和 Theano 也附带了 Conda。Keras-1.2.2 和 Tensorflow-1.0.0 与 pip 一起安装。在这两台机器上，Tiny-DNN 于 2017 年 3 月 1 日从 github 收购。
mnist-mlp implements a simple MLP with one layer of 64 hidden neurons. mnist-cnn applies two convolutional layers with 32 3-by-3 kernels and ReLU activation, followed by 2-by-2 max pooling and one 128-neuron dense layer. mul100-rnn uses two GRUs of size 160. Both input and output are 2-D binary arrays of shape (14,2) -- 28 GRU operations for each of the 30000 training samples.
mnist-mlp 实现了一个简单的 MLP，具有一层 64 个隐藏神经元。mnist-cnn 应用两个卷积层，具有 32 个 3×3 内核和 ReLU 激活，然后是 2×2 最大池化和一个 128 神经元密集层。mul100-rnn 使用两个大小为 160 的 GRU。输入和输出都是形状为（14,2）的二维二进制数组 -- 30000 个训练样本中的每一个都有 28 次 GRU作。

Task 任务	Framework 框架	Machine 机器	Device 装置	Real 真正	CPU 中央处理器	Command line 命令行
mnist-mlp	KANN+SSE CAN+SSE	Linux Linux的	1 CPU 1 个 CPU	31.3s 31.3 秒	31.2s 31.2 秒	mlp -m20 -v0
		Mac 苹果电脑	1 CPU 1 个 CPU	27.1s 27.1 秒	27.1s 27.1 秒
	KANN+BLAS CAN+BLAS	Linux Linux的	1 CPU 1 个 CPU	18.8s 18.8 秒	18.8s 18.8 秒
	Theano+Keras	Linux Linux的	1 CPU 1 个 CPU	33.7s 33.7 秒	33.2s 33.2 秒	keras/mlp.py -m20 -v0 硬/mlp.py -M20 -V0
			4 CPUs 4 个 CPU	32.0s 32.0 秒	121.3s 121.3 秒
		Mac 苹果电脑	1 CPU 1 个 CPU	37.2s 37.2 秒	35.2s 35.2 秒
			2 CPUs 2 个 CPU	32.9s 32.9 秒	62.0s 62.0 秒
	TensorFlow	Mac 苹果电脑	1 CPU 1 个 CPU	33.4s 33.4 秒	33.4s 33.4 秒	tensorflow/mlp.py -m20
			2 CPUs 2 个 CPU	29.2s 29.2 秒	50.6s 50.6 秒	tensorflow/mlp.py -m20 -t2
	Tiny-dnn	Linux Linux的	1 CPU 1 个 CPU	2m19s 2 分 19 秒	2m18s 2 分 18 秒	tiny-dnn/mlp -m20 小 DNN/MLP -M20
	Tiny-dnn+AVX	Linux Linux的	1 CPU 1 个 CPU	1m34s 1 分 34 秒	1m33s 1 分 33 秒
		Mac 苹果电脑	1 CPU 1 个 CPU	2m17s 2 分 17 秒	2m16s 2 分 16 秒
mnist-cnn	KANN+SSE CAN+SSE	Linux Linux的	1 CPU 1 个 CPU	57m57s 57 分 57 秒	57m53s 57 分 53 秒	mnist-cnn -v0 -m15
			4 CPUs 4 个 CPU	19m09s 19 分 09 秒	68m17s 68 分 17 秒	mnist-cnn -v0 -t4 -m15
	Theano+Keras	Linux Linux的	1 CPU 1 个 CPU	37m12s 37 分 12 秒	37m09s 37 分 09 秒	keras/mlp.py -Cm15 -v0 硬/mlp.py -cm15-v0
			4 CPUs 4 个 CPU	24m24s 24 分 24 秒	97m22s 97 分 22 秒
			1 GPU 1 个 GPU	2m57s 2 分 57 秒		keras/mlp.py -Cm15 -v0 硬/mlp.py -cm15-v0
	Tiny-dnn+AVX	Linux Linux的	1 CPU 1 个 CPU	300m40s 300 米 40 秒	300m23s 300 米 23 秒	tiny-dnn/mlp -Cm15 微小的 DNN/MLP -Cm15
mul100-rnn	KANN+SSE CAN+SSE	Linux Linux的	1 CPU 1 个 CPU	40m05s 40 分 05 秒	40m02s 40 分 02 秒	rnn-bit -l2 -n160 -m25 -Nd0 rnn 位 -l2 -n160 -m25 -nd0
			4 CPUs 4 个 CPU	12m13s 12 分 13 秒	44m40s 44 分 40 秒	rnn-bit -l2 -n160 -t4 -m25 -Nd0 rnn 位 -l2 -n160 -t4 -m25 -nd0
	KANN+BLAS CAN+BLAS	Linux Linux的	1 CPU 1 个 CPU	22m58s 22 分 58 秒	22m56s 22 分 56 秒	rnn-bit -l2 -n160 -m25 -Nd0 rnn 位 -l2 -n160 -m25 -nd0
			4 CPUs 4 个 CPU	8m18s 8 分 18 秒	31m26s 31 分 26 秒	rnn-bit -l2 -n160 -t4 -m25 -Nd0 rnn 位 -l2 -n160 -t4 -m25 -nd0
	Theano+Keras	Linux Linux的	1 CPU 1 个 CPU	27m30s 27 分 30 秒	27m27s 27 分 27 秒	rnn-bit.py -l2 -n160 -m25
			4 CPUs 4 个 CPU	19m52s 19 分 52 秒	77m45s 77 分 45 秒

In the single thread mode, Theano is about 50% faster than KANN probably due to efficient matrix multiplication (aka. sgemm) implemented in MKL. As is shown in a previous micro-benchmark, MKL/OpenBLAS can be twice as fast as the implementation in KANN.
在单线程模式下，Theano 比 KANN 快约 50%，这可能是由于高效的矩阵乘法（又名。 sgemm ）实现。如之前的微基准测试所示， MKL/OpenBLAS 的速度是 KANN 中实现的两倍。
KANN can optionally use the sgemm routine from a BLAS library (enabled by macro HAVE_CBLAS). Linked against OpenBLAS-0.2.19, KANN matches the single-thread performance of Theano on Mul100-rnn. KANN doesn't reduce convolution to matrix multiplication, so MNIST-cnn won't benefit from OpenBLAS. We observed that OpenBLAS is slower than the native KANN implementation when we use a mini-batch of size 1. The cause is unknown.
KANN 可以选择使用 BLAS 库中的 sgemm 例程（由 macro 启用 HAVE_CBLAS ）。KANN 与 OpenBLAS-0.2.19 链接，与 Theano 在 Mul100-rnn 上的单线程性能相当。KANN 不会将卷积简化为矩阵乘法，因此 MNIST-cnn 不会从 OpenBLAS 中受益。我们观察到，当我们使用大小为 1 的小批量时，OpenBLAS 比原生 KANN 实现慢。原因尚不清楚。
KANN's intra-batch multi-threading model is better than Theano+Keras. However, in its current form, this model probably won't get alone well with GPUs.
KANN 的批量内多线程模型优于 Theano+Keras。但是，在当前形式下，此模型可能不会单独使用 GPU。