19、ARM处理器NEON协处理器与代码优化详解

最新推荐文章于 2025-10-04 15:08:14 发布

js777

最新推荐文章于 2025-10-04 15:08:14 发布

阅读量86

点赞数

CC 4.0 BY-SA版权

分类专栏：探索64位ARM汇编文章标签： NEON协处理器 ARM处理器向量距离计算

本文链接：https://blog.youkuaiyun.com/js777/article/details/151852231

探索64位ARM汇编专栏收录该内容

24 篇文章 ¥499.90

订阅专栏¥69.90

会员秒杀 ¥9.9 重磅福利

超级会员免费看

ARM处理器NEON协处理器与代码优化详解

1. NEON协处理器简介

NEON处理器支持整数运算，涵盖了如AND、BIC和ORR等逻辑运算，还有一系列比较运算。其指令集中有许多专门用于特定算法的指令，例如对二进制环上的多项式提供直接支持，以助力某些类别的加密算法。

2. 计算4D向量距离

为了计算两个四维（4D）向量之间的距离，我们对之前的距离计算示例进行扩展。计算公式可推广到任意维度，只需在平方根下加上额外维度差值的平方。

以下是使用NEON协处理器的 distance.s 代码：

//
// Example function to calculate the distance
// between 4D two points in single precision
// floating-point using the NEON Processor
//
// Inputs:
//    X0 - pointer to the 8 FP numbers
//           they are (x1, x2, x3, x4),
//                   (y1, y2, y3, y4)
// Outputs:
//    W0 - the length (as single precision FP)
.global distance // Allow function to be called by others
//
distance:
      // load all 4 numbers at once
      LDP   Q2, Q3, [X0]
      // calc V1 = V2 - V3
      FSUB  V1.4S, V2.4S, V3.4S
      // calc V1 = V1 * V1 = (xi-yi)^2
      FMUL  V1.4S, V1.4S, V1.4S
      // calc S0 = S0 + S1 + S2 + S3
      FADDP V0.4S, V1.4S, V1.4S
      FADDP V0.4S, V0.4S, V0.4S
      // calc sqrt(S0)
      FSQRT S4, S0
      // move result to W0 to be returned
      FMOV  W0, S4
      RET

用于测试该例程的 main.s 代码如下：

//
// Main program to test our distance function
//
// W19 - loop counter
// X20 - address to current set of points
.global main // Provide program starting address to linker
//
      .equ   N, 3   // Number of points.
main:
      STP    X19, X20, [SP, #-16]!
      STR    LR, [SP, #-16]!
      LDR    X20, =points // pointer to current points
      MOV    W19, #N     // number of loop iterations
loop:    MOV    X0, X20   // move pointer to parameter 1 (r0)
      BL     distance     // call distance function
// need to take the single precision return value
// and convert it to a double, because the C printf
// function can only print doubles.
      FMOV   S2, W0      // move back to fpu for conversion
      FCVT   D0, S2      // convert single to double
      FMOV   X1, D0      // return double to r2, r3
      LDR    X0, =prtstr // load print string
      BL     printf     // print the distance
      ADD    X20, X20, #(8*4) // 8 elements each 4 bytes
      SUBS   W19, W19, #1 // decrement loop counter
      B.NE   loop         // loop if more points
      MOV    X0, #0       // return code
      LDR    LR, [SP], #16
      LDP    X19, X20, [SP], #16
      RET
.data
points: .single    0.0, 0.0, 0.0, 0.0, 17.0, 4.0, 2.0, 1.0
      .single      1.3, 5.4, 3.1, -1.5, -2.4, 0.323, 3.4, -0.232
 .single 1.323e10, -1.2e-4, 34.55, 5454.234, 10.9, -3.6, 4.2, 1.3
prtstr:      .asciz "Distance = %f\n"

makefile 如下：

distance: distance.s main.s
       gcc -g -o distance distance.s main.s

执行步骤如下：
1. 将一个向量加载到V2，另一个加载到V3。每个向量由四个32位浮点数组成，可放入一个128位的V寄存器并视为四个通道。
2. 使用单个 FSUB 指令同时减去所有四个分量，使用 FMUL 指令同时计算平方，这两个指令可并行处理所有四个通道。
3. 对V1中的所有和进行累加。由于所有数字位于不同通道，无法并行相加。不过NEON指令集提供了帮助，可进行成对加法。例如 FADDP V0.4S, V1.4S, V1.4S 会对两个参数中的每对32位浮点数进行成对相加，将所有和放入V0。
4. 使用另一个 FADDP 指令完成第三次加法，结果位于通道1，与常规浮点寄存器S0重叠。
5. 数字相加后，使用FPU的平方根指令计算最终距离。

以下是该计算流程的mermaid流程图：

graph TD;
    A[加载向量到V2和V3] --> B[FSUB计算差值];
    B --> C[FMUL计算平方];
    C --> D[FADDP第一次成对相加];
    D --> E[FADDP第二次成对相加];
    E --> F[FSQRT计算平方根];
    F --> G[FMOV移动结果到W0];

3. 优化3x3矩阵乘法

为了优化3x3矩阵乘法，我们利用NEON协处理器的并行处理能力。NEON协处理器有一个点积函数SDOT，但它仅适用于整数且并非所有处理器都支持，因此我们不使用它。

推荐的解决方案是反转之前程序中的两个循环，将乘法累加操作作为单独的指令，同时对三个向量进行操作，从而消除一个循环并实现一定程度的并行操作。

一个3x3矩阵乘法实际上是三个矩阵与向量的计算：
- Ccol1 = A ∗ Bcol1
- Ccol2 = A ∗ Bcol2
- Ccol3 = A ∗ Bcol3

以下是使用NEON协处理器的3x3矩阵乘法代码：

//
// Multiply 2 3x3 integer matrices
// Uses the NEON Coprocessor to do
// some operations in parallel.
//
// Registers:
//    D0 - first column of matrix A
//    D1 - second column of matrix A
//    D2 - third column of matrix A
//    D3 - first column of matrix B
//    D4 - second column of matrix B
//    D5 - third column of matrix B
//    D6 - first column of matrix C
//    D7 - second column of matrix C
//    D8 - third column of matrix C
.global main // Provide program starting address to linker
main:
      STP    X19, X20, [SP, #-16]!
      STR    LR, [SP, #-16]!
// load matrix A into Neon registers D0, D1, D2
      LDR    X0, =A        // Address of A
      LDP    D0, D1, [X0], #16
      LDR    D2, [X0]
// load matrix B into Neon registers D3, D4, D5
      LDR    X0, =B        // Address of B
      LDP    D3, D4, [X0], #16
      LDR    D5, [X0]
.macro mulcol ccol bcol
      MUL    \ccol\().4H, V0.4H, \bcol\().4H[0]
      MLA    \ccol\().4H, V1.4H, \bcol\().4H[1]
      MLA    \ccol\().4H, V2.4H, \bcol\().4H[2]
.endm
      mulcol V6, V3        // process first column
      mulcol V7, V4        // process second column
      mulcol V8, V5        // process third column
      LDR    X1, =C        // Address of C
      STP    D6, D7, [X1], #16
      STR    D8, [X1]
// Print out matrix C
// Loop through 3 rows printing 3 cols each time.
      MOV    W19, #3             // Print 3 rows
      LDR    X20, =C             // Addr of results matrix
printloop:
      LDR    X0, =prtstr    // printf format string
// print transpose so matrix is in usual row column order.
// first ldrh post-indexes by 2 for next row
// so second ldrh adds 6, so is ahead by 2+6=8=row size
// similarly for third ldh ahead by 2+14=16 = 2 x row size
      LDRH   W1, [X20], #2  // first element in current row
      LDRH   W2, [X20,#6]   // second element in current row
      LDRH   W3, [X20,#14]  // third element in current row
      BL     printf        // Call printf
      SUBS   W19, W19, #1   // Dec loop counter
      B.NE   printloop      // If not zero loop
      MOV    X0, #0         // return code
      LDR    LR, [SP], #16
      LDP    X19, X20, [SP], #16
      RET
.data
// First matrix in column major order
A:    .short 1, 4, 7, 0
      .short 2, 5, 8, 0
      .short 3, 6, 9, 0
// Second matrix in column major order
B:    .short 9, 6, 3, 0
      .short 8, 5, 2, 0
      .short 7, 4, 1, 0
// Result matrix in column major order
C:    .fill  12, 2, 0
prtstr: .asciz  "%3d  %3d  %3d\n"

执行步骤如下：
1. 将矩阵A加载到Neon寄存器D0、D1、D2。
2. 将矩阵B加载到Neon寄存器D3、D4、D5。
3. 使用宏 mulcol 处理每一列。
4. 将结果矩阵C存储。
5. 循环打印矩阵C的每一行。

以下是该矩阵乘法流程的mermaid流程图：

graph TD;
    A[加载矩阵A到D0-D2] --> B[加载矩阵B到D3-D5];
    B --> C[mulcol处理第一列];
    C --> D[mulcol处理第二列];
    D --> E[mulcol处理第三列];
    E --> F[存储结果矩阵C];
    F --> G[循环打印矩阵C];

4. 优化大写转换例程

4.1 原始大写转换例程

原始的大写转换例程实现了以下伪代码：

IF (W5 >= 'a') AND (W5 <= 'z') THEN
     W5 = W5 - ('a'-'A')
END IF

对应的汇编代码如下：

// If W5 > 'z' then goto cont
       CMP   W5, #'z'         // is letter > 'z'?
       B.GT  cont
// Else if W5 < 'a' then goto end if
       CMP   W5, #'a'
       B.LT  cont   // goto to end if, if < 'a'
// if we got here then the letter is lower case, so convert it.
       SUB   W5, W5, #('a'-'A')
cont:  // end if

该代码通过分支绕过 SUB 指令实现反向逻辑。在本章中，我们将尝试完全消除分支。

4.2 简化范围比较

简化范围比较的常见方法是移动范围，避免进行下限比较。如果从所有值中减去 'a' ，伪代码变为：

W6 = W5 - 'a'
IF (W6 >= 0) AND W6 <= ('z'-'a') THEN
     W5 = W5 - ('a'-'A')
END IF

将W6视为无符号整数时，第一个比较条件总是成立，因此我们将范围比较简化为一个条件：W6 <= ( 'z' - 'a' )。

以下是改进后的 upper.s 代码：

//
// Assembler program to convert a string to
// all upper case.
//
// X1 - address of output string
// X0 - address of input string
// X4 - original output string for length calc.
// W5 - current character being processed
// W6 - minus 'a' to compare < 26.
//
.global toupper      // Allow other files to call this routine
toupper: MOV   X4, X1
// The loop is until byte pointed to by X1 is non-zero
loop:  LDRB    W5, [X0], #1     //  load char and increment 
pointer
// Want to know if 'a' <= W5 <= 'z'
// First subtract 'a'
       SUB     W6, W5, #'a'
// Now want to know if W6 <= 25
       CMP     W6, #25          // chars are 0-25 after shift
       B.HI   cont
// if we got here then the letter is lower case, so convert it.
       SUB     W5, W5, #('a'-'A')
cont:  // end if
       STRB    W5, [X1], #1     // store character to output str
       CMP     W5, #0           //  stop on hitting a null 
character
       B.NE    loop             // loop if character isn't null
       SUB     X0, X1, X4      //  get the len by sub'ing the 
pointers
       RET                      // Return to caller

makefile 如下：

UPPEROBJS = main.o upper.o
UPPER2OBJS = main.o upper2.o
UPPER3OBJS = upper3.o
UPPER4OBJS = main.o upper4.o
ifdef DEBUG
DEBUGFLGS = -g
else
DEBUGFLGS =
endif
LSTFLGS =
all: upper upper2 upper3 upper4
%.o : %.s
     as $(DEBUGFLGS) $(LSTFLGS) $< -o $@
upper: $(UPPEROBJS)
     ld -o upper $(UPPEROBJS)
upper2: $(UPPER2OBJS)
     ld -o upper2 $(UPPER2OBJS)
upper3: $(UPPER3OBJS)
     ld -o upper3 $(UPPER3OBJS)
upper4: $(UPPER4OBJS)
     ld -o upper4 $(UPPER4OBJS)

4.3 使用条件指令

ARM处理器有一些有助于消除分支指令的指令，例如条件选择指令 CSEL ：

CSEL Xd, Xn, Xm, cond

该指令实现了以下逻辑：

IF cond is true then
     Xd = Xn
else
     Xd = Xm

类似于C语言中的条件运算符 Xd = cond ? Xn : Xm 。

以下是使用 CSEL 指令改进的 upper2.s 代码：

//
// Assembler program to convert a string to
// all upper case.
//
// X1 - address of output string
// X0 - address of input string
// X4 - original output string for length calc.
// W5 - current character being processed
// W6 - minus 'a' to compare < 26.
// W6 - char minus 0x20, potential upper-cased
//
.global toupper          //  Allow other files to call this 
routine
toupper:
       MOV   X4, X1
// The loop is until byte pointed to by R1 is non-zero
loop:  LDRB  W5, [X0], #1  // load char and increment pointer
// Want to know if 'a' <= W5 <= 'z'
// First subtract 'a'
       SUB   W6, W5, #'a'
// Now want to know if W6 <= 25
       CMP   W6, #25          // chars are 0-25 after shift
// perform lower case conversion to W6
       SUB   W6, W5, #('a'-'A')
// Use W6 if lower case, otherwise use original character in W5
       CSEL   W5, W6, W5, LS
       STRB  W5, [X1], #1      // store character to output str
       CMP   W5, #0            //  stop on hitting a null 
character
       B.NE  loop              // loop if character isn't null
       SUB   X0, X1, X4       //  get the len by sub'ing the 
pointers
       RET                    // Return to caller

执行步骤如下：
1. 加载字符到W5。
2. 计算W6 = W5 - ‘a’。
3. 比较W6 <= 25。
4. 计算W6 = W5 - (‘a’-‘A’)。
5. 使用 CSEL 选择结果。
6. 存储字符到输出字符串。
7. 检查是否为null字符，若不是则继续循环。

以下是该大写转换优化流程的mermaid流程图：

graph TD;
    A[加载字符到W5] --> B[计算W6 = W5 - 'a'];
    B --> C[比较W6 <= 25];
    C --> D[计算W6 = W5 - ('a'-'A')];
    D --> E[CSEL选择结果];
    E --> F[存储字符到输出字符串];
    F --> G[检查是否为null字符];
    G -- 不是 --> A;
    G -- 是 --> H[计算长度并返回];

通过以上对NEON协处理器的使用和代码优化技巧，我们可以提高程序的性能和效率。无论是向量计算、矩阵乘法还是字符转换，合理利用处理器的特性都能带来显著的改进。希望这些内容能帮助你更好地理解和应用ARM处理器的相关技术。

ARM处理器NEON协处理器与代码优化详解

5. 总结与展望

前面详细介绍了NEON协处理器的功能以及如何利用它进行4D向量距离计算、3x3矩阵乘法优化，还讲解了大写转换例程的优化过程。下面对这些内容进行总结，并给出相关练习，帮助大家进一步巩固知识。

5.1 总结

NEON协处理器 ：支持整数运算和多种逻辑、比较运算，有许多专门指令用于特定算法。在4D向量距离计算和3x3矩阵乘法中，利用其并行处理能力，通过合理安排指令，提高了计算效率。
代码优化 ：在大写转换例程中，通过简化范围比较和使用条件指令，逐步消除分支，使代码更简洁、执行效率更高。

以下是一个总结表格，对比不同优化阶段的大写转换例程：
| 优化阶段 | 比较次数 | 分支指令 | 代码结构 |
| — | — | — | — |
| 原始例程 | 2次 | 2条 | 分支逻辑复杂 |
| 简化范围比较 | 1次 | 1条 | 结构稍清晰 |
| 使用条件指令 | 1次 | 1条（循环判断） | 更结构化，易于理解 |

5.2 练习

为了帮助大家更好地掌握这些知识，以下是一些相关练习：
1. 计算4D向量的绝对值 ：一个4D向量v = (a, b, c, d)的绝对值为 $\sqrt{a^2 + b^2 + c^2 + d^2}$。可以参考4D向量距离计算的代码，进行适当修改来实现。
2. 向量归一化 ：向量的长度是其到原点（全零向量）的距离，归一化向量是长度为1的向量。通过将向量的每个分量除以其长度来实现归一化。修改距离计算程序，使其能够计算向量的归一化形式。
3. 计算4D向量的点积 ：编写一个例程来计算两个4D向量的点积。可以利用NEON协处理器的并行特性，提高计算效率。
4. 4x4矩阵乘法 ：修改3x3矩阵乘法程序，使其能够处理4x4矩阵乘法。在修改过程中，需要注意矩阵的存储方式和循环结构的调整，确保结果的正确性。

以下是一个简单的练习步骤列表，以计算4D向量的绝对值为例：
1. 加载4D向量到寄存器。
2. 计算每个分量的平方。
3. 将所有平方值相加。
4. 计算总和的平方根。

6. 最终总结

通过对NEON协处理器的深入了解和代码优化技巧的应用，我们可以看到在不同的计算任务中，合理利用处理器的特性能够显著提高程序的性能和效率。在实际开发中，我们应该根据具体的需求和场景，选择合适的优化方法。

对于NEON协处理器，其并行处理能力在向量计算和矩阵运算中表现出色，但在使用时需要注意指令的选择和数据的组织。在代码优化方面，消除分支、简化逻辑和使用条件指令等方法可以使代码更加简洁、易于维护。

希望大家通过本文的学习，能够掌握这些技术，并在实际项目中灵活运用，为提高程序的性能贡献自己的力量。

以下是一个综合的mermaid流程图，展示了整个知识体系的关联：

graph LR;
    A[NEON协处理器] --> B[4D向量距离计算];
    A --> C[3x3矩阵乘法优化];
    D[代码优化] --> E[大写转换例程优化];
    B --> F[练习：4D向量绝对值];
    B --> G[练习：向量归一化];
    B --> H[练习：4D向量点积];
    C --> I[练习：4x4矩阵乘法];
    E --> J[简化范围比较];
    E --> K[使用条件指令];

通过这个流程图，我们可以清晰地看到各个知识点之间的联系，以及练习如何与实际应用相结合，帮助我们更好地掌握这些技术。