Minimalist Vision with Freeform Pixels 翻译_a minimalist vision system-优快云博客

Doc2X：一站式文档转换与翻译工具
无论是 PDF转Word、PDF转Latex、PDF转Markdown，还是复杂的公式解析、多栏识别、沉浸式双语翻译，Doc2X 都能轻松搞定，让您的文档处理事半功倍！
Doc2X: One-Stop Document Conversion and Translation Tool
Whether it’s PDF to Word, LaTeX, or Markdown, or advanced formula parsing, multi-column recognition, and immersive bilingual translation, Doc2X simplifies your workflow, saving time and effort!
👉 立即试用 Doc2X | Start Using Doc2X

原文链接：https://cave.cs.columbia.edu/Statics/publications/pdfs/Klotz_ECCV24.pdf

Minimalist Vision with Freeform Pixels

极简主义视觉与自由形式像素

Jeremy Klotz and Shree K. Nayar

杰里米·克洛茨和什里·K·奈亚尔

Computer Science Department, Columbia University, New York NY, USA {jklotz,nayar}@cs.columbia.edu

哥伦比亚大学计算机科学系，纽约州纽约市，美国 {jklotz,nayar}@cs.columbia.edu

Abstract. A minimalist vision system uses the smallest number of pixels needed to solve a vision task. While traditional cameras use a large grid of square pixels, a minimalist camera uses freeform pixels that can take on arbitrary shapes to increase their information content. We show that the hardware of a minimalist camera can be modeled as the first layer of a neural network, where the subsequent layers are used for inference. Training the network for any given task yields the shapes of the camera’s freeform pixels, each of which is implemented using a photodetector and an optical mask. We have designed minimalist cameras for monitoring indoor spaces (with 8 pixels), measuring room lighting (with 8 pixels), and estimating traffic flow (with 8 pixels). The performance demonstrated by these systems is on par with a traditional camera with orders of magnitude more pixels. Minimalist vision has two major advantages. First, it naturally tends to preserve the privacy of individuals in the scene since the captured information is inadequate for extracting visual details. Second, since the number of measurements made by a minimalist camera is very small, we show that it can be fully self-powered, i.e., function without an external power supply or a battery.

摘要。极简主义视觉系统使用解决视觉任务所需的最少像素数量。虽然传统相机使用大量方形像素网格，但极简相机使用自由形式像素，这些像素可以采用任意形状以增加其信息内容。我们表明，极简相机的硬件可以建模为神经网络的第一层，其中后续层用于推理。针对任何给定任务训练网络会产生相机自由形式像素的形状，每个像素都使用光电探测器和光学掩模实现。我们已经为监控室内空间（8像素）、测量房间照明（8像素）和估计交通流量（8像素）设计了极简相机。这些系统展示的性能与传统相机相当，但像素数量级更多。极简主义视觉有两个主要优势。首先，它自然倾向于保护场景中个体的隐私，因为捕获的信息不足以提取视觉细节。其次，由于极简相机进行的测量数量非常少，我们表明它可以完全自供电，即无需外部电源或电池即可运行。

Keywords: Freeform Pixels - Minimalist Camera - Lightweight Vision - Self-Powered Camera - Privacy Preservation - Deep Optics - Computational Imaging

关键词：自由形式像素 - 极简相机 - 轻量级视觉 - 自供电相机 - 隐私保护 - 深度光学 - 计算成像

1 Why Minimalist Vision?

1 为什么选择极简主义视觉？

Today, computer vision plays an indispensable role in our everyday lives. It serves as the backbone in a wide gamut of applications ranging from video surveillance and monitoring to autonomous driving and robotics. Broadly speaking, vision applications can be divided into two categories. In one category, the system seeks to infer detailed information about objects and activities in a scene. Examples include object detection and recognition, optical flow estimation and tracking, and 3D reconstruction. The second category of applications involves high-level inferences about the statistics of objects in a scene and the states of an environment. Examples in this realm include monitoring the occupancy of workspaces, the flow of traffic on highways, and the lighting in an urban environment.

如今，计算机视觉在我们的日常生活中扮演着不可或缺的角色。它在从视频监控和监测到自动驾驶和机器人技术的广泛应用中发挥着支柱作用。广义上讲，视觉应用可以分为两大类。在第一类中，系统旨在推断场景中物体和活动的详细信息。例如，物体检测和识别、光流估计和跟踪以及三维重建。第二类应用涉及对场景中物体统计数据和环境状态的高级推断。这一领域的例子包括监测工作空间的占用情况、高速公路上的交通流量以及城市环境中的照明情况。

In our work, we are interested in the second category, which we refer to as “lightweight vision.” We claim that lightweight tasks can be solved not with traditional images, but rather a very small number of measurements, as long as the measurements are rich in information.

在我们的工作中，我们对第二类应用感兴趣，我们称之为“轻量级视觉”。我们声称，只要测量结果信息丰富，轻量级任务可以通过非常少量的测量来解决，而不是传统的图像。

Fig. 1: Monitoring a workspace with minimalist vision. (a) The task is to count the number of people, track the occupancy of specified zones, and detect when the door is open. A minimalist vision system, composed of a camera and inference network, can perform such lightweight tasks using just a handful of freeform pixels. (b) The entire system can be modeled as a single network. Once this network is trained, the first layer specifies the design of a camera, a prototype of which is shown in ©. This system can count the people in the room (with 2 pixels), track the occupancy of each zone (with 2 pixels, each), and detect when the door is open (with 2 pixels). Given the small number of measurements it makes, a minimalist camera can be completely self-powered.

图1：使用极简视觉监控工作空间。(a) 任务是计算人数、跟踪指定区域的占用情况以及检测门何时打开。一个由摄像头和推理网络组成的极简视觉系统可以使用少量自由形式像素来执行这些轻量级任务。(b) 整个系统可以建模为一个单一网络。一旦该网络经过训练，第一层就指定了摄像头的设计，其原型如图©所示。该系统可以计算房间内的人数（使用2个像素）、跟踪每个区域的占用情况（每个区域使用2个像素）以及检测门何时打开（使用2个像素）。鉴于其测量的数量很少，极简摄像头可以完全自供电。

We introduce minimalist vision as an approach to solve lightweight tasks. In the arts, minimalism is a technique that is used to pare down a piece of work to its essential elements. The goal is to ensure that each element used has a purpose. In our context, traditional cameras that are used in virtually all vision systems today capture far more information than needed to solve a lightweight task. Our work seeks to answer two key questions: (a) Given a task, what is the smallest number of visual measurements needed to achieve a desired performance? (b) How do we construct a camera that produces these measurements? If we are successful in designing such a minimalist camera, it would have the following two major benefits:

我们提出极简视觉作为一种解决轻量级任务的方法。在艺术中，极简主义是一种技术，用于将作品简化为基本元素。目标是确保每个使用的元素都有其目的。在我们的背景下，今天几乎所有视觉系统中使用的传统相机捕捉到的信息远远超过解决轻量级任务所需的信息。我们的工作试图回答两个关键问题：(a) 给定一个任务，实现所需性能所需的最小视觉测量数量是多少？(b) 我们如何构建一个产生这些测量的相机？如果我们成功设计了这样一个极简相机，它将具有以下两个主要优势：

Towards Privacy Preservation: When a traditional camera captures an image, it typically reveals far more information about the scene than necessary for the task. For instance, a single image could reveal a person’s identity, location, or even intentions. This is a well-known problem that has made the widespread deployment of cameras highly controversial [38]. Since minimalist vision captures the smallest number of measurements for a given task, it is difficult to extract visual details about the scene such as the biometrics of an individual. Although we cannot guarantee that privacy will be preserved in all applications, we claim that an inherent feature of our approach is that it tends to preserve privacy.

隐私保护：当传统相机捕捉图像时，它通常会揭示比任务所需更多的场景信息。例如，单张图像可能揭示一个人的身份、位置甚至意图。这是一个众所周知的问题，使得相机的广泛部署极具争议性 [38]。由于极简视觉捕捉给定任务所需的最小测量数量，因此很难提取有关场景的视觉细节，例如个人的生物特征。虽然我们不能保证在所有应用中都能保护隐私，但我们声称，我们的方法的一个固有特征是它倾向于保护隐私。

Towards Self-Sustainability: The imaging pipeline of a typical camera involves pixel readout, analog-to-digital (A/D) conversion, signal processing, and transmission. The power consumed by each of these steps, and hence the complete pipeline, is approximately linear in the number of pixels. Since a minimalist system uses an extremely small number of pixels, it consumes orders of magnitude less power than a typical camera. As a result, a minimalist camera can be designed to function using power harvested from just the light falling upon it, without using an external power supply or a battery. In other words, minimalist cameras can be completely self-sustaining and hence more widely deployed.

迈向自给自足：典型相机的成像流程包括像素读取、模数转换（A/D）、信号处理和传输。这些步骤以及整个流程所消耗的功率与像素数量大致呈线性关系。由于极简系统使用极少数量的像素，其消耗的功率比典型相机低几个数量级。因此，可以设计一种极简相机，使其仅利用落在其上的光线所收集的能量来工作，而无需使用外部电源或电池。换句话说，极简相机可以完全自给自足，从而更广泛地部署。

To achieve minimalist vision, our key insight is to allow each pixel to have an arbitrary shape. We refer to such a pixel as a “freeform pixel.” We show that a freeform pixel performs a linear projection of the scene, allowing us to model a collection of such pixels as a single layer in a neural network. Thus, a minimalist vision system, comprising both the camera and inference network, can be modeled as one network. For a given task, such as monitoring the indoor workspace in Fig. 1(a), we use a video captured from an auxiliary camera to train the network in Fig. 1(b). The trained network reveals both the shapes of the freeform pixels and the weights of the inference network. Then, a camera (Fig. 1©) is fabricated, where each freeform pixel is implemented using an optical mask and a photodetector. In Fig. 1(a), the results (people count, door status, and zone occupancy) produced by a camera with only 8 freeform pixels are overlaid on the scene image.

为了实现极简视觉，我们的关键见解是允许每个像素具有任意形状。我们将这种像素称为“自由形式像素”。我们证明，自由形式像素对场景进行线性投影，使我们能够将这些像素的集合建模为神经网络中的单一层。因此，包括相机和推理网络在内的极简视觉系统可以被建模为一个网络。对于给定的任务，例如监控图1(a)中的室内工作空间，我们使用辅助相机捕获的视频来训练图1(b)中的网络。训练后的网络揭示了自由形式像素的形状和推理网络的权重。然后，制造一个相机（图1©），其中每个自由形式像素通过光学掩模和光电探测器实现。在图1(a)中，由仅包含8个自由形式像素的相机生成的结果（人数统计、门状态和区域占用情况）叠加在场景图像上。

We have conducted extensive synthetic and real experiments that show freeform pixels can solve lightweight tasks using orders of magnitude fewer measurements than a traditional camera. We have used our prototype minimalist camera to demonstrate a variety of tasks: monitoring an indoor space (with 8 pixels), measuring room lighting (with 8 pixels), and estimating traffic flow (with 8 pixels). Finally, we show that our prototype can be powered using just the light falling on it. Under indoor lighting, it can read out and wirelessly transmit the measurements made by 24 freeform pixels at 30 frames per second without the use of an external power supply or a battery.

我们已经进行了广泛的合成和真实实验，结果表明自由形式像素可以使用比传统相机少几个数量级的测量来解决轻量级任务。我们使用我们的原型极简相机演示了各种任务：监控室内空间（使用8个像素）、测量房间照明（使用8个像素）以及估计交通流量（使用8个像素）。最后，我们展示了我们的原型可以通过仅使用落在其上的光线来供电。在室内照明下，它可以以每秒30帧的速度读取并无线传输由24个自由形式像素进行的测量，而无需使用外部电源或电池。

2 Related Work

2 相关工作

Our work is inspired by Pooj et al. [28], who introduced the concept of a minimalist camera, where each pixel is a combination of an optical mask and a photodetector. In their work, each mask was handcrafted to solve simple vision tasks such as intrusion detection and object speed estimation. Our work introduces the idea of a freeform pixel that can be automatically designed using training data for any given task. Our key observation is that a camera with freeform pixels can be modeled as the first layer of a neural network. Once the network has been trained, the first layer is used to fabricate the camera, and the rest of the network is used for inference. Furthermore, we show that minimalist cameras can be fully self-powered, making them more easily deployable than traditional cameras.

我们的工作受到Pooj等人[28]的启发，他们引入了极简相机的概念，其中每个像素是光学掩模和光电探测器的组合。在他们的工作中，每个掩模都是手工制作的，用于解决简单的视觉任务，如入侵检测和物体速度估计。我们的工作引入了自由形式像素的概念，可以根据任何给定任务的训练数据自动设计。我们的关键观察是，具有自由形式像素的相机可以被建模为神经网络的第一层。一旦网络经过训练，第一层就用于制造相机，其余网络用于推理。此外，我们展示了极简相机可以完全自供电，使其比传统相机更容易部署。

Our work is closely related to deep optics, an emerging field that jointly designs optics and software using deep learning [40]. Sitzmann et al. [30] used this approach to design an optical element for improved image quality. Subsequently, multiple works have used the approach to design optics for image enhancement $\left\lbrack {{13},{24},{31}}\right\rbrack$ and depth estimation $\left\lbrack {7,{15},{41}}\right\rbrack$ . A similar approach has been taken to design imaging lenses using differentiable ray tracing [9,20,32] and differentiable proxy functions [37]. Tseng et al. [36] used this technique to design a metasurface lens with improved image quality. In each of these works, a differentiable model for the camera’s optics is incorporated into a neural network, and the optics is designed by training the network for the specific goal. While we follow a similar approach, our motivation is different. Rather than design optics to enhance image quality or improve task performance, we design cameras that seek to preserve privacy and be self-sustaining by taking the smallest number of measurements.

我们的工作与深度光学密切相关，深度光学是一个新兴领域，它利用深度学习共同设计光学和软件 [40]。Sitzmann 等人 [30] 使用这种方法设计了一种光学元件，以提高图像质量。随后，多个研究采用了这种方法来设计用于图像增强 $\left\lbrack {{13},{24},{31}}\right\rbrack$ 和深度估计 $\left\lbrack {7,{15},{41}}\right\rbrack$ 的光学元件。类似的方法也被用于通过可微分光线追踪 [9,20,32] 和可微分代理函数 [37] 设计成像镜头。Tseng 等人 [36] 使用这种技术设计了一种具有改进图像质量的超表面镜头。在这些研究中，相机的光学系统的可微分模型被纳入神经网络中，并通过针对特定目标训练网络来设计光学系统。虽然我们采用了类似的方法，但我们的动机不同。我们设计相机的目的是通过采取最少的测量次数来保护隐私并实现自给自足，而不是为了提高图像质量或改善任务性能。

Prior work has also demonstrated the use of optics in existing network architectures to reduce the computations required during inference. Lin et al. [22] fabricated an entire image classification network using layers of diffractive optics. Others have explored hybrid approaches that implement just the first layer of a convolutional network in optics. In [8], angle-sensitive pixels were used to convolve the image with a set of commonly used filters, while in [6], optical phase masks were used to implement learned filters. The goal of our work is different; it is to minimize the number of visual measurements needed for a task, not to reduce the computations in a trained network.

先前的工作还展示了在现有网络架构中使用光学元件来减少推理过程中所需的计算量。Lin 等人 [22] 使用衍射光学层制造了一个完整的图像分类网络。其他人探索了混合方法，仅在光学中实现卷积网络的第一层。在 [8] 中，使用角度敏感像素对图像进行一组常用滤波器的卷积，而在 [6] 中，使用光学相位掩模来实现学习到的滤波器。我们工作的目标不同；我们的目标是尽量减少任务所需的可视测量次数，而不是减少训练网络中的计算量。

Duarte et al. [12] proposed a single pixel camera that captures compressive measurements of a scene. While both the single pixel camera and our minimalist camera capture linear projections of the scene, the single pixel camera uses thousands of measurements, acquired in series using a single detector, to reconstruct an image of the scene. Image and scene reconstruction has also been demonstrated using an image sensor that views the scene through an amplitude mask [2], a phase mask [4], and a diffuser [1]. While all of the above approaches aim to reconstruct an image or 3D scene, the minimalist camera circumvents image reconstruction and seeks to directly solve the task using the smallest number of measurements.

Duarte 等人 [12] 提出了一种单像素相机，该相机捕获场景的压缩测量值。虽然单像素相机和我们的极简相机都捕获了场景的线性投影，但单像素相机使用数千次测量，通过单个探测器串行获取，以重建场景的图像。还展示了使用通过振幅掩模 [2]、相位掩模 [4] 和扩散器 [1] 观察场景的图像传感器进行图像和场景重建。虽然上述所有方法都旨在重建图像或 3D 场景，但极简相机绕过了图像重建，并寻求使用最少数量的测量值直接解决任务。

Several works have explored imaging architectures that preserve an individual’s privacy while still capturing enough information to perform a task. Some of them attempt to eliminate visual details related to biometrics in captured images by using low-resolution image sensors [10] and time-of-flight sensors [17]. ${}^{1}$ Others have approached the problem by introducing optical blur $\left\lbrack {{26},{27}}\right\rbrack$ ,by performing image processing in the analog domain before readout [33], or by designing optical elements that preserve the visual feature of interest while eliminating privacy-related details [16, 34]. Since our approach is to minimize the number of visual measurements, we claim that we implicitly preserve privacy. We demonstrate this via simulations by showing that our freeform pixels are unable to perform face recognition with meaningful accuracy.

一些研究探索了在捕获足够信息以执行任务的同时保护个人隐私的成像架构。其中一些尝试通过使用低分辨率图像传感器 [10] 和飞行时间传感器 [17] 来消除捕获图像中与生物识别相关的视觉细节。 ${}^{1}$ 其他方法通过引入光学模糊 $\left\lbrack {{26},{27}}\right\rbrack$ 、在读出之前在模拟域中进行图像处理 [33]，或设计光学元件以保留感兴趣的视觉特征同时消除与隐私相关的细节 [16, 34] 来解决问题。由于我们的方法是最小化视觉测量的数量，我们声称我们隐含地保护了隐私。我们通过模拟展示了这一点，表明我们的自由形式像素无法以有意义的准确性执行人脸识别。

It is known that cameras are power-hungry-the image sensor alone can consume hundreds of milliwatts [21]. Nayar et al. [25] demonstrated a self-powered camera with $30 \times {40}$ pixels that harvests energy from the light falling on its sensor to read out full images. The harvested energy, however, was insufficient

众所周知，相机是耗电的——仅图像传感器就可能消耗数百毫瓦 [21]。Nayar 等人 [25] 展示了一种自供电相机，其 $30 \times {40}$ 像素从落在其传感器上的光中收集能量以读出完整图像。然而，收集的能量不足

${}^{1}$ Unrelated to privacy preservation,Torralba et al. 35 demonstrated image classification using a large dataset of very low resolution $\left( {{32} \times {32}}\right)$ images. In our experiments, we compare the performance of our minimalist cameras with low-resolution traditional cameras of different resolutions.

${}^{1}$ 与隐私保护无关，Torralba 等人 35 展示了使用大量极低分辨率 $\left( {{32} \times {32}}\right)$ 图像进行图像分类。在我们的实验中，我们将我们的极简相机与不同分辨率的低分辨率传统相机进行了性能比较。

—— 更多内容请到Doc2X翻译查看——
—— For more content, please visit Doc2X for translations ——