向量执行优化之手工 unrolling

maray

已于 2024-10-25 12:29:45 修改

阅读量300

点赞数 6

分类专栏： OceanBase 数据库技术编程与应用文章标签： c++ 汇编算法 1024程序员节

于 2024-10-23 23:35:56 首次发布

本文链接：https://blog.youkuaiyun.com/maray/article/details/143196426

版权

编程与应用同时被 3 个专栏收录

233 篇文章

订阅专栏

OceanBase

138 篇文章

订阅专栏

数据库技术

80 篇文章

订阅专栏

向量执行优化之手工 unrolling

Unrolling 效果评估
Update: 一个通用的 unrolling 实现：
Update：unroll 与 SIMD

Unrolling 效果评估

在数据库实现中，我们有一个认知误区：当我们把数据结构组织成向量后，只要写好 for 循环，编译器就会自动帮我们做好 SIMD 优化。

编译器往往没有那么聪明，这是因为我们的 for 循环通常首尾都是变量，编译器很难帮我们直接做循环展开。例如在 OceanBase 中遍历向量元素做计算（比如，int+int的加法）：

   for (int64_t idx = bound.start(); OB_SUCC(ret) && idx < bound.end(); ++idx) {
     ret = ArithOp::vector_op(*res_vec, *left_vec, *right_vec, idx, args...);
   }

这个代码让编译器很难做决策：bound.end() 和 bound.start() 之间有多少个元素？如果只有一个元素，静态unroll 了多个元素，那就会越界访问了。如果编译器知道 bound.end() 和 bound.start() 之间有1024 个元素，那么它就敢大胆 unroll 。

编译器能否插入一些动态分支做检测呢？理论上是可以，如果元素多，就走 unroll 分支，元素少就不走 unroll 分支。但是，要知道，编译器并不知道我们这是一个向量化的计算，它怎么会轻易做这样的优化呢？！

所以，比较靠谱的方式是手工做 unroll。也可以实现一些 unroll 模板方法来简化手工 unroll 的逻辑，这里暂不展开。

下面，主要评测下手工 unroll 的效果：

代码：

[dev-rayu ~/tools] $cat test.cpp
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <memory>
#include <cstring>


#define ALWAYS_INLINE __attribute__((always_inline))
#define NEVER_INLINE __attribute__((noinline))
#define SF_COLD __attribute__((cold, noinline))

typedef uint8_t      ub1;
typedef uint16_t     ub2;
typedef uint32_t     ub4;
typedef uint64_t     ub8;
typedef  __uint128_t ub16;

template <typename T>
static inline ALWAYS_INLINE ub4 loadUb4PostIncrement(const T* __restrict__ & in) {
  ub4 ret;
  const ub1* inUb1 = reinterpret_cast<const ub1*>(in);
  std::memcpy(&ret, inUb1, sizeof(ub4));
  in = reinterpret_cast<const T*>(inUb1 + sizeof(ub4));
  return ret;
}

template <typename T>
static inline ALWAYS_INLINE ub4 loadUb4PostIncrement2(const T* __restrict__ & in) {
  ub4 ret;
  const ub1* inUb1 = reinterpret_cast<const ub1*>(in);
  ret = *reinterpret_cast<const ub4*>(inUb1);
  in = reinterpret_cast<const T*>(inUb1 + sizeof(ub4));
  return ret;
}

int main(int argc, char **argv)
{
  ub1 *freeptr = (ub1*)malloc(10240+4*4);
  memset(freeptr, 12, 10240+4*4);
  ub4 sum = 0;
  for (int j = 0; j < atoi(argv[1]); j++) {
    const ub1 *mem = freeptr;

#ifdef UNROLL

    for (int i = 0; i < 10240 >> 2; ++i) {
  #ifdef USE_MEMCPY
      sum += loadUb4PostIncrement(mem);
      sum += loadUb4PostIncrement(mem);
      sum += loadUb4PostIncrement(mem);
      sum += loadUb4PostIncrement(mem);
  #else
      sum += loadUb4PostIncrement2(mem);
      sum += loadUb4PostIncrement2(mem);
      sum += loadUb4PostIncrement2(mem);
      sum += loadUb4PostIncrement2(mem);
  #endif
    }

#endif



#ifndef UNROLL

    for (int i = 0; i < 10240; ++i) {
  #ifdef USE_MEMCPY
      sum += loadUb4PostIncrement(mem);
  #else
      sum += loadUb4PostIncrement2(mem);
  #endif
    }

#endif


  }
#ifdef USE_MEMCPY
  printf("memcpy:%d\n", sum);
#else
  printf("assign:%d\n", sum);
#endif

  free((void*)freeptr);
  return 0;
}

Makefile:

[dev-rayu ~/tools] $cat Makefile
SHELL := /bin/bash  # Use bash for built-in time

all:
	g++ -O2 test.cpp -DUSE_ASSIGN -DUNROLL -o run_assign_unroll
	g++ -O2 test.cpp -DUSE_MEMCPY -DUNROLL -o run_memcpy_unroll
	g++ -O2 test.cpp -DUSE_ASSIGN -o run_assign
	g++ -O2 test.cpp -DUSE_MEMCPY -o run_memcpy
run:
	@time ./run_assign_unroll 102400
	@time ./run_memcpy_unroll 102400
	@time ./run_assign 102400
	@time ./run_memcpy 102400

[dev-rayu ~/tools] $make
g++ -O2 test.cpp -DUSE_ASSIGN -DUNROLL -o run_assign_unroll
g++ -O2 test.cpp -DUSE_MEMCPY -DUNROLL -o run_memcpy_unroll
g++ -O2 test.cpp -DUSE_ASSIGN -o run_assign
g++ -O2 test.cpp -DUSE_MEMCPY -o run_memcpy

[dev-rayu ~/tools] $make run
assign:-1290104832

real	0m0.163s
user	0m0.163s
sys	0m0.000s
memcpy:-1290104832

real	0m0.161s
user	0m0.161s
sys	0m0.000s
assign:-1290104832

real	0m0.453s
user	0m0.454s
sys	0m0.000s
memcpy:-1290104832

real	0m0.449s
user	0m0.449s
sys	0m0.000s

可以看到，通过手工 unroll 4 次后，性能有了3倍的大幅提升。我还测试了 unroll 8 次的效果，更炸裂，5倍提升。

[dev-rayu ~/tools] $make run
assign:-1800368128

real	0m0.080s
user	0m0.080s
sys	0m0.000s
memcpy:-1800368128

real	0m0.080s
user	0m0.080s
sys	0m0.000s
assign:-1800368128

real	0m0.445s
user	0m0.445s
sys	0m0.000s
memcpy:-1800368128

real	0m0.447s
user	0m0.447s
sys	0m0.000s

Update: 一个通用的 unrolling 实现：

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <memory>
#include <cstring>


#define ALWAYS_INLINE __attribute__((always_inline))
#define NEVER_INLINE __attribute__((noinline))
#define COLD __attribute__((cold, noinline))

typedef uint8_t      ub1;
typedef uint16_t     ub2;
typedef uint32_t     ub4;
typedef uint64_t     ub8;
typedef  __uint128_t ub16;


#include <iostream>
#include <functional>
#include <type_traits>

// 编译时展开递归，基于 UnrollFactor 完全静态完成
template<int N, typename Lambda, typename IndexType>
constexpr void ALWAYS_INLINE unroll_impl(IndexType i, Lambda body) {
    if constexpr (N > 0) {
        // 编译期递归展开 N 次
        body(i);
        // 继续展开 N-1 次
        unroll_impl<N - 1>(i + 1, body);
    }
}

// 对外接口，控制展开因子 UnrollFactor，利用 constexpr 完全展开
template<int UnrollFactor, typename Lambda, typename IndexType>
constexpr void NEVER_INLINE unroll_loop(IndexType total, Lambda body) {
    IndexType i = 0;
    // 处理主循环部分，通过 UnrollFactor 完全展开
    for (; i <= total - UnrollFactor; i += UnrollFactor) {
        unroll_impl<UnrollFactor>(i, body);
    }
    // 处理剩余未能展开的部分
    for (; i < total; ++i) {
        body(i);
    }
}

int main(int argc, char **argv) {
    const int total = atoi(argv[1]);
    int v[total] = {0};

    // 用 lambda 实现循环逻辑，设置展开因子为 4
    unroll_loop<8>(total, [&](int i) {
        v[i] = i + total;
        if (v[i] > 10) {
            v[i] = 32;
        }
    });

    // 输出结果
    for (int i = 0; i < total; ++i) {
        std::cout << v[i] << " ";
    }
    std::cout << std::endl;

    return 0;
}

对应的 ARM 汇编代码如下：

void unroll_loop<8, main::{lambda(int)#1}>(int, main::{lambda(int)#1}) [clone .isra.0]:
        stp     x29, x30, [sp, -48]!
        sub     w8, w0, #7
        mov     x29, sp
        cmp     w8, 0
        ble     .L119
        sub     w5, w0, #8
        add     x3, x2, 4
        lsr     w4, w5, 3
        add     w4, w4, 1
        ubfiz   x6, x4, 5, 30
        add     x6, x1, x6
        cmp     x2, x6
        ccmp    x1, x3, 2, cc
        ccmp    w5, 31, 0, cs
        bls     .L5
        lsr     w5, w4, 2
        mov     w6, 128
        adrp    x3, .LC0
        ld1r    {v5.4s}, [x2]
        movi    v0.4s, 0x20
        stp     d8, d9, [sp, 16]
        umaddl  x5, w5, w6, x1
        movi    v4.4s, 0xa
        str     d10, [sp, 32]
        movi    v24.4s, 0x1
        movi    v23.4s, 0x2
        movi    v22.4s, 0x3
        movi    v21.4s, 0x4
        movi    v20.4s, 0x5
        movi    v19.4s, 0x6
        movi    v18.4s, 0x7
        ldr     q17, [x3, #:lo12:.LC0]
        mov     x3, x1
.L7:
        mov     v1.16b, v17.16b
        add     v17.4s, v17.4s, v0.4s
        add     v28.4s, v1.4s, v20.4s
        add     v6.4s, v1.4s, v22.4s
        add     v27.4s, v1.4s, v18.4s
        add     v7.4s, v1.4s, v23.4s
        add     v26.4s, v1.4s, v19.4s
        add     v25.4s, v1.4s, v21.4s
        add     v3.4s, v1.4s, v24.4s
        add     v9.4s, v28.4s, v5.4s
        add     v8.4s, v6.4s, v5.4s
        add     v29.4s, v26.4s, v5.4s
        add     v16.4s, v25.4s, v5.4s
        add     v3.4s, v3.4s, v5.4s
        add     v31.4s, v27.4s, v5.4s
        add     v30.4s, v7.4s, v5.4s
        add     v1.4s, v1.4s, v5.4s
        cmge    v28.4s, v4.4s, v9.4s
        cmge    v6.4s, v4.4s, v8.4s
        cmge    v26.4s, v4.4s, v29.4s
        cmge    v25.4s, v4.4s, v16.4s
        cmge    v2.4s, v4.4s, v1.4s
        cmge    v10.4s, v4.4s, v3.4s
        cmge    v27.4s, v4.4s, v31.4s
        cmge    v7.4s, v4.4s, v30.4s
        bsl     v28.16b, v9.16b, v0.16b
        bsl     v26.16b, v29.16b, v0.16b
        bsl     v25.16b, v16.16b, v0.16b
        bif     v3.16b, v0.16b, v10.16b
        bsl     v6.16b, v8.16b, v0.16b
        bsl     v27.16b, v31.16b, v0.16b
        bif     v1.16b, v0.16b, v2.16b
        bsl     v7.16b, v30.16b, v0.16b
        zip1    v16.4s, v3.4s, v28.4s
        zip2    v2.4s, v3.4s, v28.4s
        zip1    v29.4s, v6.4s, v27.4s
        zip1    v3.4s, v1.4s, v25.4s
        zip1    v28.4s, v7.4s, v26.4s
        zip2    v6.4s, v6.4s, v27.4s
        zip2    v7.4s, v7.4s, v26.4s
        zip1    v27.4s, v16.4s, v29.4s
        zip1    v26.4s, v3.4s, v28.4s
        zip2    v1.4s, v1.4s, v25.4s
        zip2    v16.4s, v16.4s, v29.4s
        zip2    v3.4s, v3.4s, v28.4s
        zip1    v25.4s, v1.4s, v7.4s
        zip2    v1.4s, v1.4s, v7.4s
        zip1    v7.4s, v2.4s, v6.4s
        zip2    v2.4s, v2.4s, v6.4s
        zip1    v6.4s, v26.4s, v27.4s
        zip2    v26.4s, v26.4s, v27.4s
        zip1    v9.4s, v3.4s, v16.4s
        zip1    v8.4s, v25.4s, v7.4s
        zip2    v3.4s, v3.4s, v16.4s
        zip2    v25.4s, v25.4s, v7.4s
        stp     q6, q26, [x3]
        zip1    v6.4s, v1.4s, v2.4s
        zip2    v1.4s, v1.4s, v2.4s
        stp     q9, q3, [x3, 32]
        stp     q8, q25, [x3, 64]
        stp     q6, q1, [x3, 96]
        add     x3, x3, 128
        cmp     x3, x5
        bne     .L7
        and     w5, w4, -4
        lsl     w3, w5, 3
        cmp     w4, w5
        beq     .L117
        ldr     w6, [x2]
        mov     w5, 32
        mov     w30, w5
        mov     w11, w5
        add     w6, w3, w6
        mov     w10, w5
        cmp     w6, 11
        add     w12, w3, 1
        csel    w6, w6, w5, lt
        str     w6, [x1, w3, sxtw 2]
        add     w18, w3, 2
        add     w9, w3, 3
        ldr     w5, [x2]
        add     w16, w3, 4
        add     w6, w3, 5
        add     w13, w3, 7
        add     w5, w12, w5
        cmp     w5, 11
        csel    w5, w5, w11, lt
        str     w5, [x1, w12, sxtw 2]
        add     w5, w3, 6
        add     w12, w3, 8
        ldr     w11, [x2]
        add     w11, w18, w11
        cmp     w11, 11
        csel    w11, w11, w10, lt
        str     w11, [x1, w18, sxtw 2]
        ldr     w10, [x2]
        add     w10, w9, w10
        cmp     w10, 11
        csel    w10, w10, w30, lt
        str     w10, [x1, w9, sxtw 2]
        ldr     w9, [x2]
        add     w9, w16, w9
        cmp     w9, 11
        csel    w9, w9, w30, lt
        str     w9, [x1, w16, sxtw 2]
        ldr     w7, [x2]
        add     w7, w6, w7
        cmp     w7, 11
        csel    w7, w7, w30, lt
        str     w7, [x1, w6, sxtw 2]
        ldr     w6, [x2]
        add     w6, w5, w6
        cmp     w6, 11
        csel    w6, w6, w30, lt
        str     w6, [x1, w5, sxtw 2]
        ldr     w5, [x2]
        add     w5, w13, w5
        cmp     w5, 11
        csel    w5, w5, w30, lt
        str     w5, [x1, w13, sxtw 2]
        cmp     w8, w12
        ble     .L117
        ldr     w6, [x2]
        add     w13, w3, 9
        add     w18, w3, 10
        add     w9, w3, 11
        add     w6, w12, w6
        add     w16, w3, 12
        cmp     w6, 11
        csel    w6, w6, w30, lt
        str     w6, [x1, w12, sxtw 2]
        add     w6, w3, 13
        add     w12, w3, 16
        ldr     w5, [x2]
        add     w5, w13, w5
        cmp     w5, 11
        csel    w5, w5, w30, lt
        str     w5, [x1, w13, sxtw 2]
        add     w5, w3, 14
        add     w13, w3, 15
        ldr     w11, [x2]
        add     w11, w18, w11
        cmp     w11, 11
        csel    w11, w11, w30, lt
        str     w11, [x1, w18, sxtw 2]
        ldr     w10, [x2]
        add     w10, w9, w10
        cmp     w10, 11
        csel    w10, w10, w30, lt
        str     w10, [x1, w9, sxtw 2]
        ldr     w9, [x2]
        add     w9, w16, w9
        cmp     w9, 11
        csel    w9, w9, w30, lt
        str     w9, [x1, w16, sxtw 2]
        ldr     w7, [x2]
        add     w7, w6, w7
        cmp     w7, 11
        csel    w7, w7, w30, lt
        str     w7, [x1, w6, sxtw 2]
        ldr     w6, [x2]
        add     w6, w5, w6
        cmp     w6, 11
        csel    w6, w6, w30, lt
        str     w6, [x1, w5, sxtw 2]
        ldr     w5, [x2]
        add     w5, w13, w5
        cmp     w5, 11
        csel    w5, w5, w30, lt
        str     w5, [x1, w13, sxtw 2]
        cmp     w8, w12
        ble     .L117
        ldr     w6, [x2]
        add     w7, w3, 17
        add     w15, w3, 18
        add     w14, w3, 20
        add     w6, w12, w6
        add     w13, w3, 21
        cmp     w6, 11
        add     w11, w3, 22
        csel    w6, w6, w30, lt
        str     w6, [x1, w12, sxtw 2]
        add     w6, w3, 19
        add     w3, w3, 23
        ldr     w5, [x2]
        ldp     d8, d9, [sp, 16]
        add     w5, w7, w5
        cmp     w5, 11
        ldr     d10, [sp, 32]
        csel    w5, w5, w30, lt
        str     w5, [x1, w7, sxtw 2]
        ldr     w10, [x2]
        add     w10, w15, w10
        cmp     w10, 11
        csel    w10, w10, w30, lt
        str     w10, [x1, w15, sxtw 2]
        ldr     w9, [x2]
        add     w9, w6, w9
        cmp     w9, 11
        csel    w9, w9, w30, lt
        str     w9, [x1, w6, sxtw 2]
        ldr     w8, [x2]
        add     w8, w14, w8
        cmp     w8, 11
        csel    w8, w8, w30, lt
        str     w8, [x1, w14, sxtw 2]
        ldr     w7, [x2]
        add     w7, w13, w7
        cmp     w7, 11
        csel    w7, w7, w30, lt
        str     w7, [x1, w13, sxtw 2]
        ldr     w6, [x2]
        add     w6, w11, w6
        cmp     w6, 11
        csel    w6, w6, w30, lt
        str     w6, [x1, w11, sxtw 2]
        ldr     w5, [x2]
        add     w5, w3, w5
        cmp     w5, 11
        csel    w5, w5, w30, lt
        str     w5, [x1, w3, sxtw 2]
.L82:
        lsl     w3, w4, 3
.L8:
        cmp     w0, w3
        ble     .L1
        sxtw    x7, w3
        sub     w5, w0, w3
        add     x5, x7, x5
        sbfiz   x8, x3, 2, 32
        add     x9, x1, x8
        add     x6, x2, 4
        add     x5, x1, x5, lsl 2
        sub     w4, w0, #1
        cmp     x2, x5
        sub     w4, w4, w3
        ccmp    x6, x9, 0, cc
        sub     w5, w0, w3
        ccmp    w4, 4, 0, ls
        bls     .L84
        adrp    x4, .LC1
        dup     v2.4s, w3
        ld1r    {v4.4s}, [x2]
        adrp    x6, .LC2
        ldr     q0, [x4, #:lo12:.LC1]
        lsr     w4, w5, 2
        ldr     q5, [x6, #:lo12:.LC2]
        add     v0.4s, v2.4s, v0.4s
        movi    v1.4s, 0xa
        movi    v3.4s, 0x20
        add     v2.4s, v2.4s, v5.4s
        add     v0.4s, v0.4s, v4.4s
        cmge    v5.4s, v1.4s, v0.4s
        bif     v0.16b, v3.16b, v5.16b
        str     q0, [x1, x8]
        cmp     w4, 1
        bne     .L120
.L85:
        and     w4, w5, -4
        add     w3, w4, w3
        cmp     w4, w5
        beq     .L1
        ldr     w4, [x2]
        mov     w6, 32
        add     w5, w3, 1
        add     w4, w3, w4
        cmp     w4, 11
        csel    w4, w4, w6, lt
        str     w4, [x1, w3, sxtw 2]
        cmp     w0, w5
        ble     .L1
        ldr     w4, [x2]
        add     w3, w3, 2
        add     w4, w5, w4
        cmp     w4, 11
        csel    w4, w4, w6, lt
        str     w4, [x1, w5, sxtw 2]
        cmp     w0, w3
        ble     .L1
        ldr     w0, [x2]
        add     w0, w3, w0
        cmp     w0, 11
        csel    w0, w0, w6, lt
        str     w0, [x1, w3, sxtw 2]
.L1:
        ldp     x29, x30, [sp], 48
        ret
.L5:
        mov     x5, x1
        mov     w3, 0
        mov     w6, 32
        b       .L83
.L121:
        str     w7, [x5]
        add     w7, w3, 1
        ldr     w9, [x2]
        add     w7, w7, w9
        cmp     w7, 10
        bgt     .L68
.L122:
        str     w7, [x5, 4]
        add     w7, w3, 2
        ldr     w9, [x2]
        add     w7, w7, w9
        cmp     w7, 10
        bgt     .L70
.L123:
        str     w7, [x5, 8]
        add     w7, w3, 3
        ldr     w9, [x2]
        add     w7, w7, w9
        cmp     w7, 10
        bgt     .L72
.L124:
        str     w7, [x5, 12]
        add     w7, w3, 4
        ldr     w9, [x2]
        add     w7, w7, w9
        cmp     w7, 10
        bgt     .L74
.L125:
        str     w7, [x5, 16]
        add     w7, w3, 5
        ldr     w9, [x2]
        add     w7, w7, w9
        cmp     w7, 10
        bgt     .L76
.L126:
        str     w7, [x5, 20]
        add     w7, w3, 6
        ldr     w9, [x2]
        add     w7, w7, w9
        cmp     w7, 10
        bgt     .L78
.L127:
        str     w7, [x5, 24]
        add     w7, w3, 7
        ldr     w9, [x2]
        add     w7, w7, w9
        cmp     w7, 10
        bgt     .L80
.L128:
        str     w7, [x5, 28]
.L81:
        add     w3, w3, 8
        add     x5, x5, 32
        cmp     w8, w3
        ble     .L82
.L83:
        ldr     w7, [x2]
        add     w7, w3, w7
        cmp     w7, 10
        ble     .L121
        str     w6, [x5]
        add     w7, w3, 1
        ldr     w9, [x2]
        add     w7, w7, w9
        cmp     w7, 10
        ble     .L122
.L68:
        str     w6, [x5, 4]
        add     w7, w3, 2
        ldr     w9, [x2]
        add     w7, w7, w9
        cmp     w7, 10
        ble     .L123
.L70:
        str     w6, [x5, 8]
        add     w7, w3, 3
        ldr     w9, [x2]
        add     w7, w7, w9
        cmp     w7, 10
        ble     .L124
.L72:
        str     w6, [x5, 12]
        add     w7, w3, 4
        ldr     w9, [x2]
        add     w7, w7, w9
        cmp     w7, 10
        ble     .L125
.L74:
        str     w6, [x5, 16]
        add     w7, w3, 5
        ldr     w9, [x2]
        add     w7, w7, w9
        cmp     w7, 10
        ble     .L126
.L76:
        str     w6, [x5, 20]
        add     w7, w3, 6
        ldr     w9, [x2]
        add     w7, w7, w9
        cmp     w7, 10
        ble     .L127
.L78:
        str     w6, [x5, 24]
        add     w7, w3, 7
        ldr     w9, [x2]
        add     w7, w7, w9
        cmp     w7, 10
        ble     .L128
.L80:
        str     w6, [x5, 28]
        b       .L81
.L84:
        ldr     w4, [x2]
        mov     w5, 32
        add     w6, w3, 1
        add     w4, w3, w4
        cmp     w4, 11
        csel    w4, w4, w5, lt
        str     w4, [x1, x7, lsl 2]
        cmp     w0, w6
        ble     .L1
        ldr     w4, [x2]
        add     w7, w3, 2
        add     w4, w6, w4
        cmp     w4, 10
        csel    w5, w5, w4, gt
        str     w5, [x1, w6, sxtw 2]
        cmp     w0, w7
        ble     .L1
        ldr     w4, [x2]
        mov     w6, 32
        add     w5, w3, 3
        add     w4, w7, w4
        cmp     w4, 11
        csel    w4, w4, w6, lt
        str     w4, [x1, w7, sxtw 2]
        cmp     w0, w5
        ble     .L1
        ldr     w4, [x2]
        mov     w7, w6
        add     w6, w3, 4
        add     w4, w5, w4
        cmp     w4, 11
        csel    w4, w4, w7, lt
        str     w4, [x1, w5, sxtw 2]
        cmp     w0, w6
        ble     .L1
        ldr     w4, [x2]
        add     w5, w3, 5
        add     w4, w6, w4
        cmp     w4, 11
        csel    w4, w4, w7, lt
        str     w4, [x1, w6, sxtw 2]
        cmp     w0, w5
        ble     .L1
        ldr     w4, [x2]
        add     w6, w3, 6
        add     w4, w5, w4
        cmp     w4, 11
        csel    w4, w4, w7, lt
        str     w4, [x1, w5, sxtw 2]
        cmp     w0, w6
        ble     .L1
        ldr     w4, [x2]
        add     w3, w3, 7
        add     w4, w6, w4
        cmp     w4, 11
        csel    w4, w4, w7, lt
        str     w4, [x1, w6, sxtw 2]
        cmp     w0, w3
        ble     .L1
        ldr     w0, [x2]
        add     w0, w3, w0
        cmp     w0, 11
        csel    w0, w0, w7, lt
        str     w0, [x1, w3, sxtw 2]
        b       .L1
.L120:
        add     v2.4s, v4.4s, v2.4s
        cmge    v1.4s, v1.4s, v2.4s
        bsl     v1.16b, v2.16b, v3.16b
        str     q1, [x9, 16]
        b       .L85
.L117:
        ldp     d8, d9, [sp, 16]
        ldr     d10, [sp, 32]
        b       .L82
.L119:
        mov     w3, 0
        b       .L8
.LC3:
        .string " "
main:
        mov     x0, x1
        stp     x29, x30, [sp, -80]!
        mov     w2, 10
        mov     x29, sp
        ldr     x0, [x0, 8]
        mov     x1, 0
        stp     x19, x20, [sp, 16]
        stp     x21, x22, [sp, 32]
        str     x23, [sp, 48]
        bl      strtol
        mov     x19, x0
        str     w0, [x29, 76]
        sbfiz   x0, x0, 2, 32
        add     x0, x0, 15
        and     x0, x0, -16
        sub     sp, sp, x0
        mov     x20, sp
        mov     x0, x20
        str     wzr, [x0], 4
        cmp     w19, 1
        ble     .L130
        sxtw    x2, w19
        mov     w1, 0
        sub     x2, x2, #1
        lsl     x2, x2, 2
        bl      memset
.L130:
        mov     w0, w19
        add     x2, x29, 76
        mov     x1, x20
        bl      void unroll_loop<8, main::{lambda(int)#1}>(int, main::{lambda(int)#1}) [clone .isra.0]
        ldr     w0, [x29, 76]
        adrp    x23, _ZSt4cout
        add     x21, x23, :lo12:_ZSt4cout
        cmp     w0, 0
        ble     .L131
        adrp    x22, .LC3
        add     x21, x23, :lo12:_ZSt4cout
        add     x22, x22, :lo12:.LC3
        mov     x19, 0
.L132:
        ldr     w1, [x20, x19, lsl 2]
        mov     x0, x21
        add     x19, x19, 1
        bl      std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
        mov     x1, x22
        mov     x2, 1
        bl      std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
        ldr     w0, [x29, 76]
        cmp     w0, w19
        bgt     .L132
.L131:
        ldr     x0, [x23, #:lo12:_ZSt4cout]
        ldr     x0, [x0, -24]
        add     x0, x21, x0
        ldr     x19, [x0, 240]
        cbz     x19, .L140
        ldrb    w0, [x19, 56]
        cbz     w0, .L134
        ldrb    w1, [x19, 67]
.L135:
        mov     x0, x21
        bl      std::basic_ostream<char, std::char_traits<char> >::put(char)
        bl      std::basic_ostream<char, std::char_traits<char> >::flush()
        mov     sp, x29
        mov     w0, 0
        ldp     x19, x20, [sp, 16]
        ldp     x21, x22, [sp, 32]
        ldr     x23, [sp, 48]
        ldp     x29, x30, [sp], 80
        ret
.L134:
        mov     x0, x19
        bl      std::ctype<char>::_M_widen_init() const
        ldr     x2, [x19]
        mov     w1, 10
        mov     x0, x19
        ldr     x2, [x2, 48]
        blr     x2
        and     w1, w0, 255
        b       .L135
.L140:
        bl      std::__throw_bad_cast()
_GLOBAL__sub_I_main:
        stp     x29, x30, [sp, -32]!
        mov     x29, sp
        str     x19, [sp, 16]
        adrp    x19, .LANCHOR0
        add     x19, x19, :lo12:.LANCHOR0
        mov     x0, x19
        bl      std::ios_base::Init::Init() [complete object constructor]
        mov     x1, x19
        adrp    x2, __dso_handle
        ldr     x19, [sp, 16]
        add     x2, x2, :lo12:__dso_handle
        ldp     x29, x30, [sp], 32
        adrp    x0, _ZNSt8ios_base4InitD1Ev
        add     x0, x0, :lo12:_ZNSt8ios_base4InitD1Ev
        b       __cxa_atexit
.LC0:
        .word   0
        .word   8
        .word   16
        .word   24
.LC1:
        .word   0
        .word   1
        .word   2
        .word   3
.LC2:
        .word   4
        .word   5
        .word   6
        .word   7
        .zero   1

Update：unroll 与 SIMD

unrolling 的好处有两个：

为 SIMD 提供前提
减少分支判断

对于 SIMD，并不是说只要 unrolling 了，就一定能做 SIMD 操作。

通过在线编译工具 https://godbolt.org/ 可以看到下面三个代码片段中，第一个是 SIMD 友好的，第二个也能部分走 SIMD，第三个则完全没有走 SIMD。

(1) SIMD 非常友好

    unroll_loop<16>(12008, [&](ub8 i) ALWAYS_INLINE {
        itemCountsPtr[i] = hashCountsPtr[i] + i;
    });

(2) 部分走了 SIMD

    unroll_loop<16>(12008, [&](ub8 i) ALWAYS_INLINE {
        itemCountsPtr[i] = hashCountsPtr[i << 2] + i;
    });

(3) 没有生成 SIMD 代码

    unroll_loop<16>(12008, [&](ub8 i) ALWAYS_INLINE {
        itemCountsPtr[i] = hashCountsPtr[i * i] + i;
    });

为什么呢？我想这大概是和内存的访问模式有关。

(1) 的内存访问模式非常规整，连续内存的操作，很适合 SIMD。
(2) 的内存访问模式则复杂一些，右侧的内存是跳着访问的。这时候还能看到可以生成部分 SIMD 代码。
(3) 则完全没有生成任何 SIMD 逻辑。这是因为 hashCountsPtr[i*i] 对应的内存地址完全不连续，1，4，9，16，25，etc。 SIMD 不能处理这样的内存访问模式。

为了理解得更清晰，我们看几个 SIMD 指令的执行细节。首先看一个从内存加载数据的指令：

ld4 {v20.2d - v23.2d}, [x0], 64：

解释如下：

这是一个 NEON SIMD 指令，ld4 表示从内存中一次性加载 4 个向量寄存器。
{v20.2d - v23.2d} 表示将 4 个寄存器 v20 到 v23 加载为 64 位（2 个 double 精度浮点数）元素。
[x0] 表示从寄存器 x0 指向的地址读取数据，并在加载后通过 , 64 自动增加 x0 的地址偏移量 64 字节。

可以看到，ld4 只能加载 [x0] 地址处连续若干个字节的数据，无法跳着加载。

再看一个ARM NEON 存储指令：

stp q31, q31, [sp, 256] ：

解释如下：

stp：这是 ARM 的 Store Pair 指令，用于将两个 128 位的向量寄存器（NEON 寄存器）一次性存储到内存中。
q31, q31：表示要存储的两个 128 位寄存器。此处是 NEON 寄存器 q31，它存储了两次（即存储 q31
的内容到连续的两个内存位置）。
[sp, 256]：表示存储地址是 sp （栈指针）加上偏移量 256 字节的地址。即从 sp +
256 开始，将 q31 的数据存储到内存中。

也可以看到，把数据写往内存，也只能顺着写连续内存。

对于下面这样的间接寻址操作，SIMD 是非常不擅长的：

s[i] = m[v[i]];

非要用 SIMD 实现，伪代码如下：

ld1 {v0.4s}, [x0]        // 加载索引向量 v[i]，向量指令
ldr w1, [m, v0.s[0], lsl #2]    // 从 m[v[0]] 加载，普通指令
ldr w2, [m, v0.s[1], lsl #2]    // 从 m[v[1]] 加载，普通指令
ldr w3, [m, v0.s[2], lsl #2]    // 从 m[v[2]] 加载，普通指令
ldr w4, [m, v0.s[3], lsl #2]    // 从 m[v[3]] 加载，普通指令
ins v1.s[0], w1           // 插入 w1 到 v1 的槽位，普通指令
ins v1.s[1], w2           // 插入 w2 到 v1 的槽位，普通指令
ins v1.s[2], w3           // 插入 w3 到 v1 的槽位，普通指令
ins v1.s[3], w4           // 插入 w4 到 v1 的槽位，普通指令
st1 {v1.4s}, [x1]         // 将结果存储到 s[i]，向量指令

也就是说，非连续内存的访问，只适合用普通内存访问指令。

有没有允许间接寻址的 SIMD 指令呢？有。例如 AVX-512 的 “gather” 和 “scatter” 指令能支持 s[i] = m[v[i]]; 这种模式的高效访问。

在 Intel Haswell 架构里引入了 Gather 特性。它使得CPU可以使用向量索引存储器编址从存储器取非连续的数据元素。这些gather指令引入了一种新的存储器寻址形式，该形式由一个基地址寄存器（仍然是通用目的寄存器）和通过一个向量寄存器（XMM 或 YMM）所指定的多个索引构成。数据元素大小支持32位与64位，并且数据类型支持浮点型和整型。
.
更多关于 AVX512: https://blog.youkuaiyun.com/zenny_chen/article/details/130582827