嵌入视频4

最新推荐文章于 2024-04-21 14:35:41 发布

转载最新推荐文章于 2024-04-21 14:35:41 发布 · 1k 阅读

文章标签：

#图像处理

图像应用专栏收录该内容

5 篇文章

订阅专栏

Fundamentals of Embedded Video Processing (Part 4 of a 5-part series)

嵌入式视频处理基本原理（第4部分，共5部分）

By David Katz and Rick Gentile, ADI公司

In this article, the fourth installment in a five-part series, we’ll look at video flows from a processor standpoint, focusing on structures and features that enable efficient data manipulation and movement.

本文是5部分系列文章的第4部分。文中我们将从处理器的角度来考察视频流，把注意力放在那些能实现高效数据处理与转移的结构和功能特色上。

Video Port Features

视频端口的功能特色

To handle video streams, processors must have a suitable interface that can maintain a high data transfer rate into and out of the part. Some processors accomplish this through an FPGA and/or FIFO connected to the processor’s external memory interface. Typically, this device will negotiate between the constant, relatively slow stream of video (~27 MB/s for NTSC video) into/out of the processor and the sporadic but speedy bursting nature of the external memory controller (~133 MWords/sec, or 266 MB/s, capability).

要处理视频流，处理器就必须具有适当的接口，以保证数据能高速输入和输出该模块。有些处理器是通过一片连接到处理器的外部存储接口上的FPGA和/或FIFO来实现上述接口的。这种器件一般用于协调出入处理器的、速度恒定且较慢的视频流（NTSC视频的速度约～27 MB/s）与零星而高速的、猝发性的外部存储控制器数据流（其容量达到约133 MWords/s或者266 MB/s的水平）之间的关系。

However, there are problems with this arrangement. For example, FPGAs and FIFOs are expensive, often costing as much as the video processor itself. Additionally, using the external memory interface for video transfer steals bandwidth from its other prime use in these systems – moving video buffers back and forth between the processor core and external memory.

不过，这样的安排仍然存在问题。例如，FPGA和FIFO价格昂贵，往往与视频处理器本身相当。此外，使用外部存储器的接口来进行视频信号的传递，也会夺走其在这些系统中的另一项主要用途——在处理器内核与外部存储之间来回传送视频缓冲数据——所需的带宽。

Therefore, a dedicated video interface is highly preferable for media processing systems. For example, on Blackfin processors, this is the Parallel Peripheral Interface (PPI). The PPI is a multifunction parallel interface that can be configured between 8 and 16 bits in width. It supports bi-directional data flow and includes three synchronization lines and a clock pin for connection to an externally supplied clock. The PPI can gluelessly decode ITU-R BT.656 data and can also interface to ITU-R BT.601 video sources and displays, as well as TFT LCD panels. It can serve as a conduit for high-speed analog-to-digital converters (ADCs) and digital-to-analog converters (DACs). It can also emulate a host interface for an external processor.

因此，媒体处理系统最好采用一个专有的视频接口。例如在Blackfin处理器上就采用了“并行外设接口（PPI）”。PPI是一种多功能的并行接口，它可以配置为8和16bit两种带宽。它支持双向的数据流，并包含了3条同步线以及一个与外部时钟相连的时钟引脚。PPI可以对ITU-R BT.656数据进行无缝解码，而且还可以与ITU-R BT.601视频源与显示设备，例如TFT LCD平板显示器接口。它可以用作高速模数转换器（ADC）和数模转换器（DAC）的连接管道。它还可以为外部处理器仿真出主控接口。

The PPI has some built-in features that can reduce system costs and improve data flow. For instance, in BT.656 mode the PPI can decode an input video stream and automatically ignore everything except active video, effectively reducing an NTSC input video stream rate from 27 MB/s to 20 MB/s, and markedly reducing the amount of off-chip memory needed to handle the video. Alternately, it can ignore active video regions and only read in ancillary data that’s embedded in vertical blanking intervals. These modes are shown pictorially in Figure 1.

PPI的一些内置的功能可以降低系统的成本，并改善数据的流动。例如，在BT.656模式中，PPI可以对输入视频流进行解码并自动忽略有效视频之外的任何信号，从而可以有效地将一路NTSC输入视频流的码率从27MB/s降低到20MB/s，并显著减少处理视频所需的片外存储器的大小。或者，它也可以忽略有效的视频区，而只读入那些嵌入到垂直消隐间隔中的辅助数据。图1对这些模式进行了图示。

Figure 1: Selective Masking of BT.656 regions in PPI

图1：PPI接口对BT.656各区域的有选择性的掩蔽

图中：Entire Field Sent发送整个场，Active Video Only Sent——仅发送有效的视频，Blanking Only Sent——仅发送消隐信号。Blanking——消隐，Field 1 Active Video——场1的有效视频，Field 2 Active Video——场2的有效视频

Likewise, the PPI can “ignore” every other field of an interlaced stream; in other words, it will not forward this data to the DMA controller. While this instantly decimates input bandwidth requirements by 50%, it also eliminates 50% of the source content, so sometimes this tradeoff might not be acceptable. Nevertheless, this can be a useful feature when the input video resolution is much greater than the required output resolution.

类似的，在隔行扫描中PPI可以每隔一个场“忽略”一个场，换句话说，它不会把这些数据继续提交给DMA控制器。虽然这种方法可以立即让输入带宽的要求降低50％，但它也删除了50％的视频源内容，因此有时这种折衷是无法令人接受的。不过，当输入视频的分辨率远高于所需要的输出分辨率时，这是一项有用的功能。

On a similar note, the PPI allows “skipping” of odd- or even-numbered elements, again saving DMA bandwidth for the skipped pixel elements. For example, in a 4:2:2 YCbCr stream, this feature allows only luma or chroma elements to be read in, providing convenient partitioning of an algorithm between different processors; one can read in the luma, and the other can read the chroma. Also, it provides a simple way to convert an image or video stream to grayscale (luma-only). Finally, in high-speed converter applications with interleaved I/Q data, this feature allows partitioning between these in-phase and quadrature components.

类似的，PPI允许“跳过”编号为奇数或者偶数的分量，再次将所跳过的像素单元对应的DMA带宽节约下来。例如，在一个4:2:2的YCbCr流中，该功能将只容许亮度或者色度信号被读入，提供了在各种不同处理器之间进行算法划分的便捷方法；一个处理器可以读入亮度信号，而另一个则可以读入色度信号。它还提供了将一幅图像或者一路视频流转换为灰度级（仅有亮度信号）的简单方法。最后，在带有交织的I/Q数据的高速转换器应用上，这一特色使得在这些同相和正交分量之间进行划分成为可能。

Importantly, the PPI is format-agnostic, in that it is not hardwired to a specific video standard. It allows for programmable row lengths and frame lengths. This aids applications that need, say, CIF or QCIF video instead of standard NTSC/PAL formats. In general, as long as the incoming video has the proper EAV/SAV codes (for BT.656 video) or hardware synchronization signals (for BT.601 video), the PPI can process it.

重要的是，PPI并不知晓数据格式，因为它并未针对特定的视频标准来进行硬件连线。它容许用户对行长度和帧长度进行编程设定。这就给那些需要诸如CIF或者QCIF视频，而非标准的NTSC/PAL格式的应用帮了大忙。一般而言，只要输入的视频具有恰当的EAV./SAV代码（BT. 656视频）或者硬件同步信号（BT.601视频），PPI就可以对其进行处理。

Packing

包封

Although the BT.656 and BT.601 recommendations allow for 10-bit pixel elements, this is not a very friendly word length for processing. The problem is, most processors are very efficient at handling data in 8-bit, 16-bit or 32-bit chunks, but anything in-between results in data movement inefficiencies. For example, even though a 10-bit pixel value is only 2 bits wider than an 8-bit value, most processors will treat it as a 16-bit entity with the 6 most significant bits (MSBs) set to 0. Not only does this waste bandwidth on the internal data transfer (DMA) buses, but it also wastes a lot of memory – a disadvantage in video applications, where several entire frame buffers are usually stored in external memory.

虽然BT.656和BT.601推荐方案可以提供10bit的像素单元，但这对于处理来说不是一个友好的字长度。这里的问题是，大多数处理器在处理按8bit、16bit或者32bit分块的数据方面效率很高，而对于处于这些值之间的那些量值都会造成数据搬移方面的效率低下问题。例如，即使一个10bit像素值只比8bit像素值宽2bit，但大多数处理器都会将其作为一个6个高位（MSB）为0的16bit数据来处理。这不仅会浪费内部数据传输总线（DMA）的带宽，而且还会浪费很多内存——这对于视频应用来说是不利的，此时，若干个整帧的缓冲数据往往要存储在外部存储器中。

A related inefficiency associated with data sizes larger than 8 bits is non-optimal packing. Usually, a high-performance media processor will imbue its peripherals with a data packing mechanism that sits between the outside world and the internal data movement buses of the processor, and its goal is to minimize the overall bandwidth burden that the data entering or exiting the peripheral places on these buses. Therefore, an 8-bit video stream clocking into a peripheral at 27 MB/s might be packed onto a 32-bit internal data movement bus, thereby requesting service from this bus at a rate of only 27/4, or 6.75 MHz. Note that the overall data transfer rate remains the same (6.75 MHz * 32 bits = 27 MB/s). In contrast, a 10-bit video stream running at 27 MB/s would only be packed onto the 32-bit internal bus in 2 16-bit chunks, reducing the overall transfer rate to 27/2, or 13.5 MHz. In this case, since only 10 data bits out of every 16 are relevant, 37.5% of the internal bus bandwidth is wasted.

数据尺寸大于8bit所带来的效率低下，还表现在数据打包无法达到最佳性能。往往一个高性能的媒体处理器将在其周边形成数据包封机制，该包封居于外部世界和处理器内部的数据传送总线之间，其目标是尽可能减少数据出入外设时总体上给这些总线的带宽造成的负担。因此，一路8bit的视频流以27MB/s的时钟频率输入外设时，可以被打包到32bit 的内部数据搬移总线上，相应的，只需该总线提供27/4或者说6.75MHz 的服务即可。请注意，总的数据传输速率仍然相同（6.75MHz * 32bit＝27MB/s）。与之相反，一个以27MB/s传输的10bit的视频流将只能以2个16bit数据片的形式封装到32bit内部总线上，这就将总的传输速率降低至27/2或者说13.5MHz。在这种情况下，在每16位数据中只有10bit数据是相关的，故内部总线带宽的37.5%被浪费了。

Possible Data Flows

可能的数据流

It is instructive to examine some ways in which a video port connects in multimedia systems, to show how the system as a whole is interdependent on each component flow. In Figure 2a, an image source sends data to the PPI, at which point the DMA engine then dispositions it to L1 memory, where the data is processed to its final form before being sent out through a high-speed serial port. This model works very well for low-resolution video processing and for image compression algorithms like JPEG, where small blocks of video (several lines worth) can be processed and are subsequently never needed again. This flow also can work well for some data converter applications. 对多媒体系统中的视频端口的一些连接方式进行考察，展示系统作为一个整体与构成系统的每种数据流动之间的相互依存关系，将是很有益的。在图2a中，图像源将数据发送到PPI中，在该处DMA引擎将其配属L1内存，在L1内存中，数据经过处理，转换为其最终形式，然后通过一个高速串口发送出去。该模型对于低分辨率的视频处理和JPEG等图像压缩算法来说是十分有效的，此类算法可以对小规模的视频数据块（对应若干行）进行处理，此后，这些数据块在随后的过程中将不再为系统所需要。上述的流程对于有些数据转换器应用来说也是十分有效的。

In Figure 2b, the video data is not routed to L1 memory, but instead is directed to L3 memory. This configuration supports algorithms such as MPEG-2 and MPEG-4, which require storage of intermediate video frames in memory in order to perform temporal compression. In such a scenario, a bidirectional DMA stream between L1 and L3 memories allows for transfers of pixel macroblocks and other intermediate data.

在图2b中，视频数据并没有被路由到L1内存，而是被引向L3存储器。这种配置可以支持MPEG-2和MPEG-4等算法，这些算法需要将中间视频帧存储在存储器中，以执行时间域的压缩。在这种情形中，L1和L3存储器之间的双向DMA流就可以实现像素宏块以及其他中间性数据的传送。

Figure 2: Possible video port data transfer scenarios

图2: 可能的视频端口数据传输情形

图中：Image Data and Syncs——图像数据和同步信号；Processor——处理器，Memory——内存，Serial Port——串行端口，Compressed Video——经过压缩的视频；Video Data and Syncs——视频数据和同步信号。

Video ALUs

视频ALU

Most video applications need to deal with 8-bit data, since individual pixel components (whether RGB or YCbCr) are usually byte quantities. Therefore, 8-bit video ALUs and byte-based address generation can make a huge difference in pixel manipulation. This is a nontrivial point, because embedded processors typically operate on 16-bit or 32-bit boundaries.

大多数视频应用都需要对8bit数据进行处理，因为各像素分量（无论是RGB或是YCbCr）往往都是按字节计量的。因此，8bit视频ALU和基于字节的地址生成可以带来像素处理方式上的巨大差异。这一点并非无足轻重，因为嵌入式处理器一般工作在16bit或者32bit的范围内。

Embedded media processors sometimes have instructions that are geared to processing 8-bit video data efficiently. For instance, Table 1 shows a summary of the specific Blackfin instructions that can be used together to handle a variety of video operations.

嵌入式媒体处理器有时支持那些能高效地处理8bit视频和数据的语句。以表1为例，其中总结出专门的Blackfin指令，综合运用这些指令，就可以实现各种不同类型的视频操作。

Table 1: Native Blackfin video instructions

指令	描述	算法的应用
Byte对准	将连续的4字节未对准的字从两个数据寄存器的组合中拷贝出来。	用于对数据字节进行对准，以满足随后执行的SIMD指令的需要。
双16bit 加/减	向两个16bit、带符号的量值添加两个8bit无符号的量值，然后限制在8bit的无符号范围中。	主要用于视频运动压缩算法。
双16 bit累加器提取外加加法	将每个累加器的字的上半部分和下半部分相加到一起，然后载入目标寄存器中。	与Quad SAA指令一起使用，实现运动的估计
4重 8bit 加法器	将两个无符号的4重字节数组相加	对于提供视频处理应用中典型的包封数据运算来说非常有用。
4重8bit 平均值－字节	逐个字节计算出两个4重字节数组的算术平均值	支持分形运动搜索和运动估计中所使用的二进制内插法。
4重8bit平均值—半个字	逐个字节计算出两个4重字节数组的算术平均值，而且把结果放置在半个字的边界上。	支持分形运动搜索和运动估计中所使用的二进制内插法。
4重8bit Pack	将4个8bit值包封为32bit的寄存器	为ALU操作准备数据
4重8bit 减法	两个4字节数组的逐字节相减	在视频应用中提供包封的数据运算
4重8bit减法－绝对值－累加	实现4对值的相减，取出其绝对值，然后可以进行累加。	对于基于块的视频运动估计来说非常有效。
4重8Bit解包封	从一对源寄存器处拷贝4个连续的字节

Let’s look at a few examples of how these instructions can be used.

让我们考察几个关于这些指令的运用的示例。

The Quad 8-bit Subtract-Absolute-Accumulate (SAA) instruction is well-suited for block-based video motion estimation. The instruction subtracts four pairs of bytes, takes the absolute value of each difference, and accumulates the results. This all happens within a single cycle. The actual formula is shown below:

4重的8bit 减法－绝对值－累加（SAA）指令非常适合于基于块的视频运动估算。该指令先对4对字节进行减法操作，并取每个差值的绝对值，然后将结果累加起来。这都是在一个周期内发生的。下面是其确切的表达式。

Consider the macroblocks shown in Figure 3a. The reference frame of 16 pixels x 16 pixels can be further divided into 4 groups. A very reasonable assumption is that neighboring video frames are correlated to each other. That is, if there is motion, then pieces of each frame will move in relation to macroblocks in previous frames. It takes less information to encode the movement of macroblocks than it does to encode each video frame as a separate entity -- MPEG compression uses this technique.

试考虑如图3a所示的宏块。16像素×16像素的基准帧可以进一步分为4组。一个非常合理的假设是相邻的视频帧是彼此相关的。也就是说，如果存在运动，则每个帧的各片区域将相对于前面诸帧的宏块发生移动。对宏块的运动进行编码所需要的信息，要少于将每个视频帧视为单独对象进行编码时所需的信息 – MPEG压缩使用了此技术。

This motion detection of macroblocks decomposes into two basic steps. Given a reference macroblock in one frame, we can search all surrounding macroblocks (target macroblocks) in a subsequent frame to determine the closest match. The offset in location between the reference macroblock (in Frame n) and the best-matching target macroblock (in Frame n+1) is the motion vector.

这种对宏块进行的运动检测可分解为两个基本步骤。设定一个帧中的基准宏块之后，我们就能搜寻下一帧中所有在其周围的宏块（目标宏块），以决定最接近的匹配。基准宏块（在帧n中）的位置与可与之实现最佳匹配的目标宏块（在帧n＋1中）之间存在的偏移，就是运动矢量。

Figure 3b shows how this can be visualized in a system.

图3b图示出了系统中该方法的具体体现方式。

u Circle = some object in a video frame

u 圆 = 视频帧中的某些对象

u Solid square = reference macroblock

u 实线方框 = 基准宏块

u Dashed square = search area for possible macroblocks

u 虚线方框 = 对区域进行搜索，以发现可能的宏块。

u Dotted square = best-matching target macroblock (i.e., the one representing the motion vector of the circle object)

u 点方框 = 最佳匹配的目标宏块（例如代表圆物体的运动矢量宏块）

Figure 3: Illustration of Subtract-Absolute-Accumulate (SAA) instruction

图3: 减法－绝对值－累加（ SAA）指令的图示

图中：Reference Macroblock——基准宏块，Target Macroblock——目标宏块。

16 pixel——16像素，4Groups of 4 pixel——4组，每组4个像素，Frame——帧

The SAA instruction on a Blackfin processor is fast because it utilizes four 8-bit ALUs in each clock cycle. We can implement the following loop to iterate over each of the four entities shown in Figure 3b.

一个Blackfin 处理器的SAA指令执行速度很快，因为它在每个时钟周期中可利用4个8bit ALU。我们可以实现如下的循环，以对图3b所示的4个对象中的每一个进行迭代处理。

/* used in a loop that iterates over an image block */

/* 用于对图像组块进行迭代处理的循环中 */

SAA (R1:0,R3:2) || R1 = [I0++] || R2 = [I1++]; /* compute absolute difference and accumulate */

/*计算出绝对值并累加*/

SAA (R1:0,R3:2) (R) || R0 = [I0++] || R3 = [I1++];

SAA (R1:0,R3:2) || R1 = [I0 ++ M3] || R2 = [I1++M1]; /* after fetch of 4th word of target block, pointer is made to point to the next row */

/*取出目标组块的第4个字后，指针指向下一行*/

SAA (R1:0,R3:2) (R) || R0 = [I0++] || R2 = [I1++];

Let’s now consider another example, the 4-Neighborhood Average computation whose basic kernel is shown in Figure 4a. Normally, four additions and one division (or multiplication or shift) are necessary to compute the average. The BYTEOP2P instruction can accelerate the implementation of this filter.

现在让我们考虑另一个实例，即求4个相邻值平均值的运算，图4a示出其基本内核。通常情况下，必须进行4次加法和一次除法（或者乘法或者移位），才能计算出平均值。BYTEOP2P指令可以加快该滤波器的实现。

The value of the center pixel of Figure 4b is defined as:

图4b中心像素的量值被定义为：

x = Average(xN, xS, xE, xW)

The BYTEOP2P can perform this kind of average on two pixels (Figures 6.21c,d) in 1 cycle. So, if x1 = Average(x1N, x1S, x1E, x1W), and x2 = Average(x2N, x2S, x2E, x2W), then

BYTEOP2P可以在一个周期内对两个像素执行这种求平均值处理（图6.21 c,d）。因此，如果x1 = Average(x1N, x1S, x1E, x1W)，x2 = Average(x2N, x2S, x2E, x2W)，则

R3 = BYTEOP2P(R1:0, R3:2);

Will compute both pixel averages in a single cycle, assuming the x1 (N, S, E, W) information is stored in registers R1 and R0, and the x2 (N, S, E, W) data is sourced from R3 and R2.

将在单个周期内同时计算出两个像素的平均值，前提是x1 (N, S, E, W)信息存储在R1和R0，x2 (N, S, E, W)则从R3和R2取出。

Figure 4 Neighborhood Average Computation

图4 相邻像素的平均值计算

DMA Considerations

关于DMA的考虑

An embedded media processor with two-dimensional DMA (2D DMA) capability offers several system-level benefits. For starters, 2D DMA can facilitate transfers of macroblocks to and from external memory, allowing data manipulation as part of the actual transfer. This eliminates the overhead typically associated with transferring non-contiguous data. It can also allow the system to minimize data bandwidth by selectively transferring, say, only the desired region of an input image, instead of the entire image.

具备二维 DMA（2D DMA）能力的嵌入式媒体处理器可以在系统层次上提供若干优点。对于新手而言，2D DMA可以实现宏块对应于外部存储器的来回传递，从而使得对数据的处理成为实际传输的一部分。这消除了传输非连续数据时相应出现的开销。它还可以让系统进行有选择性的传输，即仅传输输入图像中为人们所需要的区域，而不是传输整幅图像，从而尽可能减小数据的带宽。

As another example, 2D DMA allows data to be placed into memory in a sequence more natural to processing. For example, as shown in Figure 5, RGB data may enter a processor’s L2 memory from a CCD sensor in interleaved RGB444 format, but using 2D DMA, it can be transferred to L3 memory in separate R, G and B planes. Interleaving/deinterleaving color space components for video and image data saves additional data moves prior to processing.

另一个实例是，2D DMA可以让数据置入存储器的顺序更顺应处理的需求。例如，如图5所示，RGB数据可以从一个CCD传感器以交织的RGB444格式进入处理器的L2存储器，但是在使用2D DMA的情况下，它可以以分离的R、G和B平面形式传送到L3存储器中。对视频和图象数据颜色空间的分量进行交织/解交织化在处理前节省了额外的数据搬移。

Figure 5: Deinterleaving data with 2D DMA

图5：利用2D DMA实现数据的解交织

图中：说明文字，从左到右，来自于传感器的RGB数字输入，存储器中来自视频端口的输入，在存储器中进行重新映射的、存储的缓冲信号。

Planar vs. Interleaved Buffer Formats

平面化和间插式缓冲格式的对比

How do you decide whether to structure your memory buffers as interleaved or planar? The advantage to interleaved data is that it’s the natural output format of image sensors, and the natural input format for video encoders. However, planar buffers (that is, separate memory regions for each pixel component) are more effective structures for many video algorithms, since many of them (JPEG and MPEG included) work on luma and chroma channels separately. What’s more, accessing planar buffers in L3 is more efficient than striding through interleaved data, because the latency penalty for SDRAM page misses is spread out over a much larger sample size when the buffers are structured in a planar manner.

在“存储缓冲采用交织的结构还是平面化的结构？”这一问题上，你应该如何做出决策？交织的数据的优点是，它是图像传感器的自然输出格式，而且也是视频编码器自然而然的输入格式。不过，平面化的缓冲器（即每个像素分量存储在存储器中相互分离的区域中）对许多视频算法而言是更为有效的结构，因为它们中的许多（包括JPEG和MPEG）都是依靠单独的亮度和色度信号来工作的。此外，对L3中平面化缓冲器的访问，其效率要比跨越交织数据进行读取的操作更高，这是因为，当缓冲器采用平面化结构时， SDRAM页面缺失造成的延迟会随着图样尺寸的大幅增加而发散。

Double-Buffering

双重缓冲

We have previously discussed the need for double-buffering as a means of ensuring that current data is not overwritten by new data until you’re ready for this to happen. Managing a video display buffer serves as a perfect example of this scheme. Normally, in systems involving different rates between source video and the final displayed content, it’s necessary to have a smooth switchover between the old content and the new video frame. This is accomplished using a double-buffer arrangement. One buffer points to the present video frame, which is sent to the display at a certain refresh rate. The second buffer fills with the newest output frame. When this latter buffer is full, a DMA interrupt signals that it’s time to output the new frame to the display. At this point, the first buffer starts filling with processed video for display, while the second buffer outputs the current display frame. The two buffers keep switching back and forth in a “ping-pong” arrangement.

我们在前面已经讨论了对双重缓冲的需求，这是因为它可以确保当前数据不被新的数据所覆盖，直到你已经为这种覆盖做好准备为止。对视频显示缓冲区的管理就是这一方法的绝好实例。通常，在系统中，如果各种视频源与最终显示的内容之间存在传输速率差异的话，就应该保证在老的内容和新的视频帧之间实现平滑的切换。这是利用双缓冲管理方法来实现的。一个缓冲区指向目前的视频帧，该帧被以一定的刷新速率送到显示器上。第二个缓冲区则用最新输出的帧来填充。当后一个缓冲器被填满时，DMA发出中断信号，指示现在应该将新的帧发送到显示器上。此时，第一个缓冲区开始填充经过处理的、用于显示的视频信号，而第二个缓冲区则输出当前的显示帧。这两个缓冲区以“乒乓”方式来回切换。

It should be noted that multiple buffers can be used, instead of just two, in order to provide more margin for synchronization, and to reduce the frequency of interrupts and their associated latencies.

我们应该注意的是，系统可以使用多个缓冲区而不仅仅是两个，以便为同步化提供更大的余地，并降低中断发生的频率及其相应造成的延迟。

So now we’ve covered some basic issues and features associated with efficient video data movement in embedded applications. In the final part of this series, we’ll extend these ideas into a “walkthrough” of a sample embedded video application.

现在，我们已经讨论了一些与嵌入式应用中有效的视频数据的搬移有关的基本问题和功能特色。在本系列文章中的第4部分，我们将把这些思想扩展到一个示例性的嵌入式视频应用的“排演”中。