转:data alignment的作用

本文探讨了内存访问粒度的概念,解释了处理器如何以不同大小的数据块访问内存,并讨论了内存对齐的重要性。文章通过实例展示了未对齐内存访问可能导致的性能下降,甚至软件故障。

研究的起源:研究DMA支援的一个特点就是byte alignment的作用,老大上次特别提出,还是好好研究一番,这个资料挺有意思。

Data alignment: Straighten up and fly right

 
 

Data alignment: Straighten up and fly right

Align your data for speed and correctness

Memory access granularity

Programmers are conditioned to think of memory as a simple array of bytes. Among C and its descendants, char* is ubiquitous as meaning "a block of memory", and even Java™ has its byte[] type to represent raw memory.


Figure 1. How programmers see memory
How Programmers See Memory

However, your computer's processor does not read from and write to memory in byte-sized chunks. Instead, it accesses memory in two-, four-, eight- 16- or even 32-byte chunks. We'll call the size in which a processor accesses memory its memory access granularity.


Figure 2. How processors see memory
How Some Processors See Memory

The difference between how high-level programmers think of memory and how modern processors actually work with memory raises interesting issues that this article explores.

If you don't understand and address alignment issues in your software, the following scenarios, in increasing order of severity, are all possible:

  • Your software will run slower.
  • Your application will lock up.
  • Your operating system will crash.
  • Your software will silently fail, yielding incorrect results.

Alignment fundamentals

To illustrate the principles behind alignment, examine a constant task, and how it's affected by a processor's memory access granularity. The task is simple: first read four bytes from address 0 into the processor's register. Then read four bytes from address 1 into the same register.

First examine what would happen on a processor with a one-byte memory access granularity:


Figure 3. Single-byte memory access granularity
Single-byte memory access granularity

This fits in with the naive programmer's model of how memory works: it takes the same four memory accesses to read from address 0 as it does from address 1. Now see what would happen on a processor with two-byte granularity, like the original 68000:


Figure 4. Double-byte memory access granularity
Double-byte memory access granularity

When reading from address 0, a processor with two-byte granularity takes half the number of memory accesses as a processor with one-byte granularity. Because each memory access entails a fixed amount overhead, minimizing the number of accesses can really help performance.

However, notice what happens when reading from address 1. Because the address doesn't fall evenly on the processor's memory access boundary, the processor has extra work to do. Such an address is known as an unaligned address. Because address 1 is unaligned, a processor with two-byte granularity must perform an extra memory access, slowing down the operation.

Finally, examine what would happen on a processor with four-byte memory access granularity, like the 68030 or PowerPC® 601:


Figure 5. Quad-byte memory access granularity
Quad-byte memory access granularity

A processor with four-byte granularity can slurp up four bytes from an aligned address with one read. Also note that reading from an unaligned address doubles the access count.

Now that you understand the fundamentals behind aligned data access, you can explore some of the issues related to alignment.


Lazy processors

A processor has to perform some tricks when instructed to access an unaligned address. Going back to the example of reading four bytes from address 1 on a processor with four-byte granularity, you can work out exactly what needs to be done:


Figure 6. How processors handle unaligned memory access
How processors handle unaligned memory access

The processor needs to read the first chunk of the unaligned address and shift out the "unwanted" bytes from the first chunk. Then it needs to read the second chunk of the unaligned address and shift out some of its information. Finally, the two are merged together for placement in the register. It's a lot of work.

Some processors just aren't willing to do all of that work for you.

The original 68000 was a processor with two-byte granularity and lacked the circuitry to cope with unaligned addresses. When presented with such an address, the processor would throw an exception. The original Mac OS didn't take very kindly to this exception, and would usually demand the user restart the machine. Ouch.

Later processors in the 680x0 series, such as the 68020, lifted this restriction and performed the necessary work for you. This explains why some old software that works on the 68020 crashes on the 68000. It also explains why, way back when, some old Mac coders initialized pointers with odd addresses. On the original Mac, if the pointer was accessed without being reassigned to a valid address, the Mac would immediately drop into the debugger. Often they could then examine the calling chain stack and figure out where the mistake was.

All processors have a finite number of transistors to get work done. Adding unaligned address access support cuts into this "transistor budget." These transistors could otherwise be used to make other portions of the processor work faster, or add new functionality altogether.

An example of a processor that sacrifices unaligned address access support in the name of speed is MIPS. MIPS is a great example of a processor that does away with almost all frivolity in the name of getting real work done faster.

The PowerPC takes a hybrid approach. Every PowerPC processor to date has hardware support for unaligned 32-bit integer access. While you still pay a performance penalty for unaligned access, it tends to be small.

On the other hand, modern PowerPC processors lack hardware support for unaligned 64-bit floating-point access. When asked to load an unaligned floating-point number from memory, modern PowerPC processors will throw an exception and have the operating system perform the alignment chores in software. Performing alignment in software is much slower than performing it in hardware.


Speed

Writing some tests illustrates the performance penalties of unaligned memory access. The test is simple: you read, negate, and write back the numbers in a ten-megabyte buffer. These tests have two variables:

  1. The size, in bytes, in which you process the buffer. First you'll process the buffer one byte at a time. Then you'll move onto two-, four- and eight-bytes at a time.
  2. The alignment of the buffer. You'll stagger the alignment of the buffer by incrementing the pointer to the buffer and running each test again.

These tests were performed on a 800 MHz PowerBook G4. To help normalize performance fluctuations from interrupt processing, each test was run ten times, keeping the average of the runs. First up is the test that operates on a single byte at a time:


Listing 1. Munging data one byte at a time
void Munge8( void *data, uint32_t size ) {
    uint8_t *data8 = (uint8_t*) data;
    uint8_t *data8End = data8 + size;
    
    while( data8 != data8End ) {
        *data8++ = -*data8;
    }
}

It took an average of 67,364 microseconds to execute this function. Now modify it to work on two bytes at a time instead of one byte at a time -- which will halve the number of memory accesses:


Listing 2. Munging data two bytes at a time
void Munge16( void *data, uint32_t size ) {
    uint16_t *data16 = (uint16_t*) data;
    uint16_t *data16End = data16 + (size >> 1); /* Divide size by 2. */
    uint8_t *data8 = (uint8_t*) data16End;
    uint8_t *data8End = data8 + (size & 0x00000001); /* Strip upper 31 bits. */
    
    while( data16 != data16End ) {
        *data16++ = -*data16;
    }
    while( data8 != data8End ) {
        *data8++ = -*data8;
    }
}

This function took 48,765 microseconds to process the same ten-megabyte buffer -- 38% faster than Munge8. However, that buffer was aligned. If the buffer is unaligned, the time required increases to 66,385 microseconds -- about a 27% speed penalty. The following chart illustrates the performance pattern of aligned memory accesses versus unaligned accesses:


Figure 7. Single-byte access versus double-byte access
Single-byte access versus double-byte access

The first thing you notice is that accessing memory one byte at a time is uniformly slow. The second item of interest is that when accessing memory two bytes at a time, whenever the address is not evenly divisible by two, that 27% speed penalty rears its ugly head.

Now up the ante, and process the buffer four bytes at a time:


Listing 3. Munging data four bytes at a time
void Munge32( void *data, uint32_t size ) {
    uint32_t *data32 = (uint32_t*) data;
    uint32_t *data32End = data32 + (size >> 2); /* Divide size by 4. */
    uint8_t *data8 = (uint8_t*) data32End;
    uint8_t *data8End = data8 + (size & 0x00000003); /* Strip upper 30 bits. */
    
    while( data32 != data32End ) {
        *data32++ = -*data32;
    }
    while( data8 != data8End ) {
        *data8++ = -*data8;
    }
}

This function processes an aligned buffer in 43,043 microseconds and an unaligned buffer in 55,775 microseconds, respectively. Thus, on this test machine, accessing unaligned memory four bytes at a time is slower than accessing aligned memory two bytes at a time:


Figure 8. Single- versus double- versus quad-byte access
Single- versus double- versus quad-byte access

Now for the horror story: processing the buffer eight bytes at a time.


Listing 4. Munging data eight bytes at a time
void Munge64( void *data, uint32_t size ) {
    double *data64 = (double*) data;
    double *data64End = data64 + (size >> 3); /* Divide size by 8. */
    uint8_t *data8 = (uint8_t*) data64End;
    uint8_t *data8End = data8 + (size & 0x00000007); /* Strip upper 29 bits. */
    
    while( data64 != data64End ) {
        *data64++ = -*data64;
    }
    while( data8 != data8End ) {
        *data8++ = -*data8;
    }
}

Munge64 processes an aligned buffer in 39,085 microseconds -- about 10% faster than processing the buffer four bytes at a time. However, processing an unaligned buffer takes an amazing 1,841,155 microseconds -- two orders of magnitude slower than aligned access, an outstanding 4,610% performance penalty!

What happened? Because modern PowerPC processors lack hardware support for unaligned floating-point access, the processor throws an exception for each unaligned access. The operating system catches this exception and performs the alignment in software. Here's a chart illustrating the penalty, and when it occurs:


Figure 9. Multiple-byte access comparison
Multiple-byte access comparison

The penalties for one-, two- and four-byte unaligned access are dwarfed by the horrendous unaligned eight-byte penalty. Maybe this chart, removing the top (and thus the tremendous gulf between the two numbers), will be clearer:


Figure 10. Multiple-byte access comparison #2
Multiple-byte access comparison #2

There's another subtle insight hidden in this data. Compare eight-byte access speeds on four-byte boundaries:


Figure 11. Multiple-byte access comparison #3
Multiple-byte access comparison #3

Notice accessing memory eight bytes at a time on four- and twelve- byte boundaries is slower than reading the same memory four or even two bytes at a time. While PowerPCs have hardware support for four-byte aligned eight-byte doubles, you still pay a performance penalty if you use that support. Granted, it's no where near the 4,610% penalty, but it's certainly noticeable. Moral of the story: accessing memory in large chunks can be slower than accessing memory in small chunks, if that access is not aligned.


Atomicity

All modern processors offer atomic instructions. These special instructions are crucial for synchronizing two or more concurrent tasks. As the name implies, atomic instructions must be indivisible -- that's why they're so handy for synchronization: they can't be preempted.

It turns out that in order for atomic instructions to perform correctly, the addresses you pass them must be at least four-byte aligned. This is because of a subtle interaction between atomic instructions and virtual memory.

If an address is unaligned, it requires at least two memory accesses. But what happens if the desired data spans two pages of virtual memory? This could lead to a situation where the first page is resident while the last page is not. Upon access, in the middle of the instruction, a page fault would be generated, executing the virtual memory management swap-in code, destroying the atomicity of the instruction. To keep things simple and correct, both the 68K and PowerPC require that atomically manipulated addresses always be at least four-byte aligned.

Unfortunately, the PowerPC does not throw an exception when atomically storing to an unaligned address. Instead, the store simply always fails. This is bad because most atomic functions are written to retry upon a failed store, under the assumption they were preempted. These two circumstances combine to where your program will go into an infinite loop if you attempt to atomically store to an unaligned address. Oops.


Altivec

Altivec is all about speed. Unaligned memory access slows down the processor and costs precious transistors. Thus, the Altivec engineers took a page from the MIPS playbook and simply don't support unaligned memory access. Because Altivec works with sixteen-byte chunks at a time, all addresses passed to Altivec must be sixteen-byte aligned. What's scary is what happens if your address is not aligned.

Altivec won't throw an exception to warn you about the unaligned address. Instead, Altivec simply ignores the lower four bits of the address and charges ahead, operating on the wrong address. This means your program may silently corrupt memory or return incorrect results if you don't explicitly make sure all your data is aligned.

There is an advantage to Altivec's bit-stripping ways. Because you don't need to explicitly truncate (align-down) an address, this behavior can save you an instruction or two when handing addresses to the processor.

This is not to say Altivec can't process unaligned memory. You can find detailed instructions how to do so on the Altivec Programming Environments Manual (see Resources). It requires more work, but because memory is so slow compared to the processor, the overhead for such shenanigans is surprisingly low.


Structure alignment

Examine the following structure:


Listing 5. An innocent structure
void Munge64( void *data, uint32_t size ) {
typedef struct {
    char    a;
    long    b;
    char    c;
}   Struct;

What is the size of this structure in bytes? Many programmers will answer "6 bytes." It makes sense: one byte for a, four bytes for b and another byte for c. 1 + 4 + 1 equals 6. Here's how it would lay out in memory:

Field TypeField NameField OffsetField SizeField End
chara011
longb145
charc516
Total Size in Bytes:6

However, if you were to ask your compiler to sizeof( Struct ), chances are the answer you'd get back would be greater than six, perhaps eight or even twenty-four. There's two reasons for this: backwards compatibility and efficiency.

First, backwards compatibility. Remember the 68000 was a processor with two-byte memory access granularity, and would throw an exception upon encountering an odd address. If you were to read from or write to field b, you'd attempt to access an odd address. If a debugger weren't installed, the old Mac OS would throw up a System Error dialog box with one button: Restart. Yikes!

So, instead of laying out your fields just the way you wrote them, the compiler padded the structure so that b and c would reside at even addresses:

Field TypeField NameField OffsetField SizeField End
chara011
padding112
longb246
charc617
padding718
Total Size in Bytes:8

Padding is the act of adding otherwise unused space to a structure to make fields line up in a desired way. Now, when the 68020 came out with built-in hardware support for unaligned memory access, this padding was unnecessary. However, it didn't hurt anything, and it even helped a little in performance.

The second reason is efficiency. Nowadays, on PowerPC machines, two-byte alignment is nice, but four-byte or eight-byte is better. You probably don't care anymore that the original 68000 choked on unaligned structures, but you probably care about potential 4,610% performance penalties, which can happen if a double field doesn't sit aligned in a structure of your devising.


Conclusion

If you don't understand and explicitly code for data alignment:

  • Your software may hit performance-killing unaligned memory access exceptions, which invoke very expensive alignment exception handlers.
  • Your application may attempt to atomically store to an unaligned address, causing your application to lock up.
  • Your application may attempt to pass an unaligned address to Altivec, resulting in Altivec reading from and/or writing to the wrong part of memory, silently corrupting data or yielding incorrect results.

Credits

Thanks to Alex Rosenberg and Ian Ollmann for feedback, Matt Slot for his FastTimes timing library, and Duane Hayes for providing a bevy of testing machines.


Resources

About the author

Jonathan 'Wolf' Rentzsch runs Red Shed Software, a small Illinois software boutique. He also leads PSIG, a suburban Mac programmer group, and co-hosts CAWUG, a downtown Mac & WebObjects programmer group.

 

from :http://www.ibm.com/developerworks/library/pa-dalign/

<think>嗯,用户问的是关于嵌入式系统或微控制器编程中ADC数据对齐的问题。这属于嵌入式开发的底层硬件接口知识,需要解释清楚数据对齐的概念、原理和实际应用场景。 用户可能是嵌入式工程师或电子专业学生,正在调试ADC采集数据。从问题表述看,ta对专业术语很熟悉(直接用了"data alignment"),说明有一定基础。但没提具体芯片型号,可能需要通用性解答。 核心要覆盖三个层面:首先是硬件层面ADC寄存器的物理特性(12位ADC为什么用16位寄存器),其次是软件层面的数据提取技巧(左对齐/右对齐的位移操作),最后是系统层面的优化考量(DMA传输效率)。特别要注意强调对齐方式与有符号数的处理关系,这是容易出bug的点。 准备用STM32的参考手册作为主要依据,因为它的ADC模块很典型。需要说明:当配置为右对齐时,读取的16位寄存器中有效数据在低12位,高位是扩展符号位(如果配置为有符号模式)或零(无符号模式)。左对齐则相反,数据在高12位,低位补零——这种格式适合某些不需要移位运算的显示场景。 突然想到用户可能遇到的实际困惑:为什么采集到的数值突然跳变?这很可能是对齐方式与数据处理不匹配导致的。比如配置了右对齐却按左对齐解析,数值会相差16倍(2^4)。要在回答中突出这种实际调试技巧。 相关扩展问题要覆盖设计层面(如DMA优化)和调试层面(如逻辑分析仪观测),因为用户下一步很可能要解决具体工程问题。最后补充不同架构的差异提示(TI的C2000系列处理方式就不同),避免用户直接套用导致错误。</think>在嵌入式系统中,ADC(模数换器)的数据对齐是确保采集到的模拟信号数字值被正确存储和处理的关键技术。其核心在于**如何将ADC换结果(通常为12位)放入微控制器的寄存器(通常为16位或32位)中**。以下是详细解析: --- ### **一、数据对齐的两种主要方式** 假设ADC分辨率为12位(如STM32系列),目标寄存器为16位: 1. **右对齐(Right-aligned)** - **存储方式**:换结果的**低12位**有效,寄存器高4位补0(无符号数)或符号扩展(有符号数)。 - **读取示例**: ```c uint16_t adc_value = ADC1->DR; // 直接读取,值范围 0x0000~0x0FFF (0~4095) ``` - **优点**:数值直接对应物理量(如0=0V,4095=3.3V),无需额外计算[^1]。 2. **左对齐(Left-aligned)** - **存储方式**:换结果的**高12位**有效,寄存器低4位补0。 - **读取示例**: ```c uint16_t adc_value = ADC1->DR >> 4; // 需右移4位得到实际值 ``` - **应用场景**:某些场景下简化计算(如直接用于DAC输出)[^2]。 --- ### **二、对齐方式配置方法(以STM32为例)** 在ADC控制寄存器(`ADC_CR2`)中设置`ALIGN`位: ```c // 右对齐(默认) ADC1->CR2 &= ~ADC_CR2_ALIGN; // 左对齐 ADC1->CR2 |= ADC_CR2_ALIGN; ``` --- ### **三、关键影响与设计考量** 1. **有符号数据(如差分输入)** - 右对齐时,**高4位为符号扩展位**(例如负值会填充`0xF000`)。 - 需使用**带符号类型**读取: ```c int16_t adc_value = (int16_t)ADC1->DR; // 自动处理符号扩展 ``` 2. **DMA传输优化** - 右对齐数据可直接传输到内存数组,左对齐需额外移位操作,增加CPU负载[^3]。 3. **精度与计算效率** - 右对齐:保留原始精度,但计算时需注意数据类型(如乘法可能溢出)。 - 左对齐:牺牲低4位精度,但某些滤波算法可减少移位操作。 --- ### **四、调试技巧** - **逻辑分析仪观测**:直接读取ADC数据寄存器,验证对齐格式(如0xFFF左对齐=0xFFF0)。 - **电压反推验证**: ```c float voltage = (adc_value * 3.3f) / 4095; // 右对齐 float voltage = (adc_value >> 4) * (3.3f / 4095); // 左对齐 ``` --- ### **典型问题场景** > **问题**:配置为右对齐,但读取值超过4095? > **原因**:未处理有符号数据。当输入电压为负时,寄存器值可能为`0xFxxx`(补码),若用`uint16_t`读取会显示>4095。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值