U-Net: Convolutional Networks for Biomedical Image Segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox
Computer Science Department and BIOSS Centre for Biological Signalling Studies,
University of Freiburg, Germany
ronneber@informatik.uni-freiburg.de,
WWW home page: http://lmb.informatik.uni-freiburg.de/
Abstract. There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available athttp://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.
摘要。人们普遍认为,成功训练深度网络需要数千个带注释的训练样本。在本文中,我们提出了一种网络和训练策略,该策略依赖于数据增强的强大使用来更有效地使用可用的注释样本。该架构由用于捕获上下文的收缩路径和能够实现精确定位的对称扩展路径组成。我们表明,这样的网络可以从很少的图像进行端到端训练,并且在 ISBI 挑战上优于先前的最佳方法(滑动窗口卷积网络),用于在电子显微镜堆栈中分割神经元结构。使用在透射光学显微镜图像(相位对比度和 DIC)上训练的相同网络,我们在很大程度上赢得了 2015 年 ISBI 细胞跟踪挑战赛。此外,网络速度很快。在最近的 GPU 上,512x512 图像的分割需要不到第二个图像。完整的实现(基于 Caffe)和经过训练的网络可在 http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net 获得。
Fig. 1. U-net architecture (example for 32x32 pixels in the lowest resolution). Each blue box corresponds to a multi-channel feature map. The number of channels is denoted on top of the box. The x-y-size is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations.
图1所示。U-net架构(最低分辨率为32 × 32像素的示例)。每个蓝框对应一个多通道特征图。通道的数量表示在盒子的顶部。x-y尺寸在框的左下边缘提供。白框表示复制的特征图。箭头表示不同的操作。
1 Introduction
In the last two years, deep convolutional networks have outperformed the state of the art in many visual recognition tasks, e.g. [7]. While convolutional networks have already existed for a long time [8], their success was limited due to the size of the available training sets and the size of the considered networks. The breakthrough by Krizhevsky et al. [7] was due to supervised training of a large network with 8 layers and millions of parameters on the ImageNet dataset with 1 million training images. Since then, even larger and deeper networks have been trained [12].
在过去的两年里,深度卷积网络在许多视觉识别任务上的表现都超过了目前的技术水平,例如b[7]。虽然卷积网络已经存在很长时间了,但由于可用训练集的大小和所考虑的网络的大小,它们的成功受到限制。Krizhevsky et al.[7]的突破是由于在具有100万张训练图像的ImageNet数据集上对具有8层和数百万个参数的大型网络进行监督训练。从那时起,甚至更大更深的网络被训练成b[12]。
The typical use of convolutional networks is on classification tasks, where the output to an image is a single class label. However, in many visual tasks, especially in biomedical image processing, the desired output should include localization, i.e., a class label is supposed to be assigned to each pixel. Moreover, thousands of training images are usually beyond reach in biomedical tasks. Hence, Ciresan et al. [2] trained a network in a sliding-window setup to predict the class label of each pixel by providing a local region (patch) around that pixel as input. First, this network can localize. Secondly, the training data in terms of patches is much larger than the number of training images. The resulting network won the EM segmentation challenge at ISBI 2012 by a large margin.
卷积网络的典型用途是分类任务,其中图像的输出是单个类标签。然而,在许多视觉任务中,特别是在生物医学图像处理中,期望的输出应该包括定位,即,应该为每个像素分配一个类标签。此外,成千上万的训练图像在生物医学任务中通常是遥不可及的。因此,Ciresan等人在滑动窗口设置中训练了一个网络,通过提供像素周围的局部区域(patch)作为输入来预测每个像素的类标签。首先,这个网络可以本地化。其次,以patch为单位的训练数据量远大于训练图像的数量。由此产生的网络在ISBI 2012上以很大的优势赢得了EM