Using SIMD Instructions For Image Processing

本文介绍如何使用SIMD指令集(如SSE2、Altivec)提高图像处理性能,包括亮度调整及YUV422到RGB转换,并提供不同平台上的性能对比。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

转自:http://www.mikekohn.net/stuff/image_processing.php

 

Using SIMD Instructions For Image Processing

Related pages on www.mikekohn.net: SSE Image Processing, GIF, TIFF, BMP/RLE, JPEG, AVI, kunzip, gif2avi, Ringtone Tools, yuv2rgb, RTSP

Contents

Introduction
Explanation Of Brightness
Explanation Of YUV422 To RGB
How SIMD Helps Brightness
How SIMD Helps YUV422 to RGB
C Brightness Source
ASM Brightness Source
SIMD on XScale
Altivec SIMD on PPC
Performance Numbers
Download Full Source Code

Introduction

This page is an example of how to use SSE (actually more correctly SSE2) that exists in Pentium4 and AMD64 CPU's to improve performance of image processing functions that increase brightness and YUV422 to RGB conversion. I also have examples here of how to use Altivec on the PowerPC CPU and WMMX on Xscale.

SSE (basically a 128 bit version of the 64 bit MMX instruction set) is Intel's SIMD (Single Instruction, Multiple Data) instruction set included on Pentium III and extended to SSE2 on Pentium 4's. SSE2 adds integer math to SSE's floating point processor.

All assembly code posted on this page is written by me and anyone is free to cut and past it and use it in their own programs. The asm code here assembles with the NASM Assembler.

Btw, if you'd like to play around with YUV colorspace, I created a small Javascript program that can convert between RGB and YUV and display what the color values look like on this page: http://www.mikekohn.net/file_formats/yuv_rgb_converter.php.

Also, if you came here interested in graphics, you might like some of my other projects: FPGA VGA, SX VGA, and Atmel VGA.

Explanation Of Brightness

The brighter test program here reads in a BMP file (color or not color) and converts it to black and white. The image data is stored in a buffer of width*height bytes where each byte represents the brightness of each pixel (0 is black, 255 is white, and all numbers inbetween are shades of gray).

To increase the brightness of an image, the value of every byte in the image buffer is increased by some value. To make the image darker, every byte in the buffer is decreased in value. For a color image, to be technically correct, the image needs to be in YUV format and the Y portion can be treated like a black and white image using this function. If this function was used on a standard RGB buffer, I don't think it would work properly, especially at the saturation points of the buffer, but it's worth a try :). Converting to and from YUV from RGB is pretty computationally expensive.

Explanation Of YUV422 to RGB

YUV is another colorspace that can be used to represent image. An explanation of YUV can be found on Wikipedia here: http://en.wikipedia.org/wiki/YUV. YUV422 planer represents Y as single bytes in the top part of the buffer, while U is represented next at 1/2 the resolution of Y, and V is represented last at 1/2 the resolution of Y. For every 2 Y (brightness) bytes, there is 1 U (color) and 1 V (color).

I wrote 3 different versions of YUV422 to RGB. The first one follows the exact formula of YUV as described on Wikipedia, the second uses a integer/shifting trick to get rid of some of the multiplication and floating points, and the third is based on the floating point version but written in total assembly language using SSE. I was actually able to almost double the speed of the original integer/shift version by using some simple lookup tables to get rid of all the multiplication and saturation, but I haven't posted that version :). Maybe one of these days I'll try an SSE integer version too.

How SIMD helps Brightness

SSE adds eight 128 bit registers to the x86 instruction set. These registers can do load/store operations to and from memory 128 bit at a time (well, in one instruction at least), but when doing math operations the register gets divided up into either sixteen single bytes, eight 16 bit words, four 32 bit double-words, four single precision floats, or 2 double precision floats.

In the brightness example, every single byte of the xmm1 register is loaded with the same single byte value. In the brightness loop, xmm0 is loaded with the next 16 bytes in the buffer. Using paddusb (parallel add unsigned bytes with saturation), xmm1 is added to xmm0. Every byte of the xmm1 register is added to every byte of the xmm0 register. Because paddusb uses "saturation" if the resulting byte would overflow it's simply set to 255. The xmm0 is then written back to memory.

Example:

If the value passed to the function was 3, and the memory at the start of the image was 00 01 02 03 04 05 06 07 248 249 250 251 252 253 254 255 (a 16 pixel gradient from black to white):

xmm1 = 0x03030303030303030303030303030303

After the movdqa xmm0, [edi] instruction:

xmm0 = 0x0706050403020100fffefdfcfbfaf9f8

After paddusb xmm0, xmm1

xmm0 = 0x0a 09 08 07 06 05 04 03 ff ff ff ff fe fd fc fb

After movdqa [edi], xmm0  the memory at the address pointed to by edi would be:

03 04 05 06 07 08 09 10 251 252 253 254 255 255 255 255

Things to remember:

  • When using movdqa, memory addresses read/written must be a multiple of 16 (aka aligned on a 16 byte boundary). Otherwise the movdqu (which is a bit slower than movdqa) must be used.
  • SSE still follows little endian byte ordering. This means if you read in 16 bytes, the SSE register will hold the bytes in reverse order of what they were in memory. In this example it doesn't hurt anything, but in other programs it might make life very difficult. This is one of the things I hate most about the Intel CPU :(.
  • Memory load/store operations are quite slow. On a Pentium 4 with a 64 bit databus it should take 2 memory cycles to load an SSE register. Avoiding load/store operations will increase overall performance.

How SIMD helps YUV422 to RGB

For the YUV422, I use SSE process 4 pixels at one time. I set up "vectors" of 4 floating points. In my example I have the following vectors:

VecY=(Y0, Y1, Y2, Y3)
VecU=(U0, U0, U1, U1)
VecV=(V0, V0, V1, V1)
VecConst1=(1.13983, 1.13983, 1.13983, 1.13983)
VecConst2=(-0.39466, -0.39466, -0.39466, -0.39466) 
VecConst3=(-0.58060, -0.58060, -0.58060, -0.58060)
VecConst4=(2.03211, 2.03211, 2.03211, 2.03211)
VectN128=(-128, -128, -128, -128)
Vec255=(255, 255, 255, 255)
Vec0=(0, 0, 0, 0)

So using the YUV to RGB formulas as described on the Wikipedia page, it's pretty simple to do the math on all the vectors in assembly language get the RGB pixels. In the image_proc download at the bottom of the page, the image_proc_sse.asm has a pretty well commented example of the finished function. Btw, I haven't finished optimizing this function yet, so I might get some more speed out of it later :).

 

C Version Of Brightness

void brightness(unsigned char *buffer, int len, int v)
{
int t,r;

  if (v>0)
  {
    for (t=0; t<len; t++)
    {
      r=buffer[t]+v;
      if (r>255) r=255;
      buffer[t]=r;
    }
  }
    else
  {
    for (t=0; t<len; t++)
    {
      r=buffer[t]+v;
      if (r<0) r=0;
      buffer[t]=r;
    }
  }
}

SSE2 Version Of Brightness

global brightness_sse

section .code
bits 32

; void brightness_sse(unsigned char *image, int len, int v)

brightness_sse:
  push ebp
  push edi
  mov ebp, esp
  mov edi, [ebp+12]   ; unsigned char *image
  mov ecx, [ebp+16]   ; int len
  mov eax, [ebp+20]   ; int v

  jle bright_not_neg  ; check if v is negative
  neg al               ; make al abs(v)

bright_not_neg:

  shr ecx, 4          ; count = image_len / 16

  mov ah, al          ; make xmm1 =  (v,v,v,v ,v,v,v,v, ,v,v,v,v, v,v,v,v)
  pinsrw xmm1, ax, 0
  pinsrw xmm1, ax, 1
  pinsrw xmm1, ax, 2
  pinsrw xmm1, ax, 3
  pinsrw xmm1, ax, 4
  pinsrw xmm1, ax, 5
  pinsrw xmm1, ax, 6
  pinsrw xmm1, ax, 7

  test eax, 0xff000000    ; if v was negative, make it darker by abs(v)
  jnz dark_loop

bright_loop:
  movdqa xmm0, [edi]     ; for every 16 byte chunks, add v to all 16 bytes
  paddusb xmm0, xmm1     ; paddusb adds each 16 bytes of xmm0 by v but
  movdqa [edi], xmm0     ; if the byte overflows (more than 255) set to 255

  add edi, 16            ; ptr=ptr+16
  loop bright_loop       ; while (count>0)
  jmp bright_exit

dark_loop:
  movdqa xmm0, [edi]     ; same as above but subtract v from each of the
  psubusb xmm0, xmm1     ; 16 bytes that make up xmm0.  if a byte will
  movdqa [edi], xmm0     ; become negative, set it to 0 (saturation)

  add edi, 16            ; ptr=ptr+16
  loop dark_loop         ; while (count>0)

bright_exit:

  pop edi
  pop ebp
  ret                    ; return

XScale

The Gumstix Verdex embedded computer with an ARM XScale cpu has a wmmx instruction set. I actually had one of these at my last job so I was able to test a little code with it. If anyone would like to donate one of these platforms to me so I can continue testing on it, let me know :).

Atmel AVR32

I got one of these Atmel ATNGW100 embedded computer boards with an AT32AP7000 cpu on it. It's an AVR32 with some SIMD and DSP instructions. I ran the non-SIMD version of brightness on it and I'm trying to figure out how to get ahold of the compiler intrinsics so I can make an SIMD version to test.

Altivec on PowerPC

I've started translating the SSE/x86 code to Altivec/PowerPC for MacOSX and the Cell CPU found in the Playstation 3. After benchmarking this the C code on Playstation 3 Linux, I was kinda disappointed with the results, so I translated it to straight PPC assembly and PPC+Altivec. Unfortunately, I did all the development on MacOSX using the "as" assembler which doesn't appear to be compatible with the "as" assembler on Playstation 3 Linux, so I have to rewrite it. The benchmark on the Mac G4 looks pretty good tho, I'll post the results soon.

In the future i'm hoping to translate the code to one of the Cell's SPU's. I also plan on adding Altivec YUV422 to RGB.

Source Code: image_proc_altivec.asm

Performance

The following table shows the difference between the C and SSE2 version of the brighter() function. The time represents how long it took to read in the bmp, call the brighter routine 100,000 times, and then write out a modified bmp. Note: Performance differences could be due to memory bus speed and not to processor speed. I can't remember what speed of memory are in these two boxes, but the AMD64 box is a laptop which typically have slower memory.

Brightness Adjust (100,000 iterations on a 352x240 image)

PlatformC VersionASM/SIMDCompilerFlags
Linux/AMD64 1.8GHz20.4s2.8sgcc-3.4.6-march=k8 -m32
Linux/AMD64 1.8GHz20.6sn/agcc-3.4.6-march=k8 -m64
Windows/Penium4 3.02GHz16.7s0.9sgcc-3.4.2-march=pentium4
Sun Netra 400MHz UltraSparc IIi3m2s-gcc 4.1.1-m32 -mtune=ultrasparc
Sun Netra 400MHz UltraSparc IIi1m36s-Sun cc (ss8)-xO5 -xarch=v9 (64 bit)
Gumstix Verdex 600MHz XScale3m55s2m48sgcc-4.1.1-march=iwmmxt
AVR32 AP7000 120MHz10m44s-gcc-4.2.1-O3 -fno-common
Broadcom BCM3302 200MHz
Linksys OpenWRT
13m20s-gcc-3.4.4-O2
Dec Alpha 500MHz PWS 500au2m43s-gcc 3.4.6-O2 -mcpu=ev56
Dec Alpha 500MHz PWS 500au2m5s-cc-O2 -arch ev56

YUV422 to RGB (10,000 iterations on a 704x480 image)

PlatformC floatC integerASM/SSE2CompilerFlags
Linux/AMD64 1.8GHz2m4s43s56sgcc-3.4.6-O2 -march=k8 -m32
Linux/AMD64 1.8GHz1m2s34.8sn/agcc-3.4.6-O2 -march=k8 -m64
Windows/Penium4 3.02GHz2m15s59s50sgcc-3.4.2-O2 -march=pentium4
Sun Netra 400MHz UltraSparc IIi11m4s3m59s-gcc 4.1.1-O2 -m32 -mtune=ultrasparc
Sun Netra 400MHz UltraSparc IIi9m9s4m7s-Sun cc (ss8)-xO5 -xarch=v9 (64 bit)
Dec Alpha 500MHz PWS 500au8m54s3m12s-gcc 3.4.6-O2 -mcpu=ev56
Dec Alpha 500MHz PWS 500au5m34s2m55s-cc-O2 -arch ev56

Note: -m32 tells gcc to compile for a 32 bit cpu while -m64 says to compile for 64 bit

 

I made a multithreaded version of the yuv2rgb.c. It breaks up the 10,000 interations over multiple threads.

YUV422 to RGB (10,000 iterations on a 704x480 image) 64 bit compiled C code only

PlatformFloatIntegerThreadsFlags
intel core2quad Q6600 2.4GHz
1066MHz memory
Linux 2.6
17.4s12.5s2gcc-4.1.3 -march=nocona
intel core2quad Q6600 2.4GHz
1066MHz memory
Linux 2.6
8.7s6.2s4gcc-4.1.3 -march=nocona
MacMini intel core2 2GHz
667MHz memory
MacOSX 10.5
58.5s47.5s1gcc-4.0.1 -march=nocona
MacMini intel core2 2GHz
667MHz memory
MacOSX 10.5
29.4s23.9s2gcc-4.0.1 -march=nocona

Download

image_proc-2007-04-23.tar.gz
image_proc-2007-04-23.zip

### 关于SIMD指令在计算机架构和编程中的应用 #### SIMD指令简介 现代处理器通过引入单指令多数据流(SIMD)技术来提升处理效率。这种技术允许一条指令同时对多个数据项执行相同的操作,从而显著提高计算密集型任务的性能[^1]。 #### 编程支持 对于开发者而言,在C++这样的高级语言中利用SIMD并不意味着必须编写汇编代码。Clang以及GCC等主流编译器提供了内置函数(intrinsics),使得程序员能够更方便地调用底层硬件特性而无需深入了解具体实现细节。 ```cpp #include <immintrin.h> // 使用AVX intrinsic进行向量化加法运算 __m256 vec_a = _mm256_loadu_ps(a); __m256 vec_b = _mm256_loadu_ps(b); __m256 result_vec = _mm256_add_ps(vec_a, vec_b); _mm256_storeu_ps(result, result_vec); ``` 上述例子展示了如何借助Intel AVX扩展集提供的内建函数完成一次8个float类型的并行相加操作。这里`_mm256_*`系列函数即为针对特定平台优化过的SIMD指令封装形式之一。 #### 架构层面的支持与发展 从体系结构角度来看,为了更好地发挥SIMD的优势,一些新的尝试如VLIW(Very Long Instruction Word)和EPIC(Explicitly Parallel Instruction Computing)被提出。这些设计思路试图简化硬件逻辑的同时增加每周期内的并发度,不过这也给编译工具链带来了更大挑战——它们需要更加智能化地安排指令调度以充分利用资源[^3]。 尽管如此,随着摩尔定律逐渐失效,探索更多维度上的并行化成为必然趋势;而在这一过程中,SIMD作为基础构建模块将继续扮演重要角色。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值