向量执行优化之手工 unrolling
Unrolling 效果评估
在数据库实现中,我们有一个认知误区:当我们把数据结构组织成向量后,只要写好 for 循环,编译器就会自动帮我们做好 SIMD 优化。
编译器往往没有那么聪明,这是因为我们的 for 循环通常首尾都是变量,编译器很难帮我们直接做循环展开。例如在 OceanBase 中遍历向量元素做计算(比如,int+int的加法):
for (int64_t idx = bound.start(); OB_SUCC(ret) && idx < bound.end(); ++idx) {
ret = ArithOp::vector_op(*res_vec, *left_vec, *right_vec, idx, args...);
}
这个代码让编译器很难做决策:bound.end() 和 bound.start() 之间有多少个元素?如果只有一个元素,静态unroll 了多个元素,那就会越界访问了。如果编译器知道 bound.end() 和 bound.start() 之间有1024 个元素,那么它就敢大胆 unroll 。
编译器能否插入一些动态分支做检测呢?理论上是可以,如果元素多,就走 unroll 分支,元素少就不走 unroll 分支。但是,要知道,编译器并不知道我们这是一个向量化的计算,它怎么会轻易做这样的优化呢?!
所以,比较靠谱的方式是手工做 unroll。也可以实现一些 unroll 模板方法来简化手工 unroll 的逻辑,这里暂不展开。
下面,主要评测下手工 unroll 的效果:
代码:
[dev-rayu ~/tools] $cat test.cpp
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <memory>
#include <cstring>
#define ALWAYS_INLINE __attribute__((always_inline))
#define NEVER_INLINE __attribute__((noinline))
#define SF_COLD __attribute__((cold, noinline))
typedef uint8_t ub1;
typedef uint16_t ub2;
typedef uint32_t ub4;
typedef uint64_t ub8;
typedef __uint128_t ub16;
template <typename T>
static inline ALWAYS_INLINE ub4 loadUb4PostIncrement(const T* __restrict__ & in) {
ub4 ret;
const ub1* inUb1 = reinterpret_cast<const ub1*>(in);
std::memcpy(&ret, inUb1, sizeof(ub4));
in = reinterpret_cast<const T*>(inUb1 + sizeof(ub4));
return ret;
}
template <typename T>
static inline ALWAYS_INLINE ub4 loadUb4PostIncrement2(const T* __restrict__ & in) {
ub4 ret;
const ub1* inUb1 = reinterpret_cast<const ub1*>(in);
ret = *reinterpret_cast<const ub4*>(inUb1);
in = reinterpret_cast<const T*>(inUb1 + sizeof(ub4));
return ret;
}
int main(int argc, char **argv)
{
ub1 *freeptr = (ub1*)malloc(10240+4*4);
memset(freeptr, 12, 10240+4*4);
ub4 sum = 0;
for (int j = 0; j < atoi(argv[1]); j++) {
const ub1 *mem = freeptr;
#ifdef UNROLL
for (int i = 0; i < 10240 >> 2; ++i) {
#ifdef USE_MEMCPY
sum += loadUb4PostIncrement(mem);
sum += loadUb4PostIncrement(mem);
sum += loadUb4PostIncrement(mem);
sum += loadUb4PostIncrement(mem);
#else
sum += loadUb4PostIncrement2(mem);
sum += loadUb4PostIncrement2(mem);
sum += loadUb4PostIncrement2(mem);
sum += loadUb4PostIncrement2(mem);
#endif
}
#endif
#ifndef UNROLL
for (int i = 0; i < 10240; ++i) {
#ifdef USE_MEMCPY
sum += loadUb4PostIncrement(mem);
#else
sum += loadUb4PostIncrement2(mem);
#endif
}
#endif
}
#ifdef USE_MEMCPY
printf("memcpy:%d\n", sum);
#else
printf("assign:%d\n", sum);
#endif
free((void*)freeptr);
return 0;
}
Makefile:
[dev-rayu ~/tools] $cat Makefile
SHELL := /bin/bash # Use bash for built-in time
all:
g++ -O2 test.cpp -DUSE_ASSIGN -DUNROLL -o run_assign_unroll
g++ -O2 test.cpp -DUSE_MEMCPY -DUNROLL -o run_memcpy_unroll
g++ -O2 test.cpp -DUSE_ASSIGN -o run_assign
g++ -O2 test.cpp -DUSE_MEMCPY -o run_memcpy
run:
@time ./run_assign_unroll 102400
@time ./run_memcpy_unroll 102400
@time ./run_assign 102400
@time ./run_memcpy 102400
[dev-rayu ~/tools] $make
g++ -O2 test.cpp -DUSE_ASSIGN -DUNROLL -o run_assign_unroll
g++ -O2 test.cpp -DUSE_MEMCPY -DUNROLL -o run_memcpy_unroll
g++ -O2 test.cpp -DUSE_ASSIGN -o run_assign
g++ -O2 test.cpp -DUSE_MEMCPY -o run_memcpy
[dev-rayu ~/tools] $make run
assign:-1290104832
real 0m0.163s
user 0m0.163s
sys 0m0.000s
memcpy:-1290104832
real 0m0.161s
user 0m0.161s
sys 0m0.000s
assign:-1290104832
real 0m0.453s
user 0m0.454s
sys 0m0.000s
memcpy:-1290104832
real 0m0.449s
user 0m0.449s
sys 0m0.000s
可以看到,通过手工 unroll 4 次后,性能有了3倍的大幅提升。我还测试了 unroll 8 次的效果,更炸裂,5倍提升。
[dev-rayu ~/tools] $make run
assign:-1800368128
real 0m0.080s
user 0m0.080s
sys 0m0.000s
memcpy:-1800368128
real 0m0.080s
user 0m0.080s
sys 0m0.000s
assign:-1800368128
real 0m0.445s
user 0m0.445s
sys 0m0.000s
memcpy:-1800368128
real 0m0.447s
user 0m0.447s
sys 0m0.000s
Update: 一个通用的 unrolling 实现:
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <memory>
#include <cstring>
#define ALWAYS_INLINE __attribute__((always_inline))
#define NEVER_INLINE __attribute__((noinline))
#define COLD __attribute__((cold, noinline))
typedef uint8_t ub1;
typedef uint16_t ub2;
typedef uint32_t ub4;
typedef uint64_t ub8;
typedef __uint128_t ub16;
#include <iostream>
#include <functional>
#include <type_traits>
// 编译时展开递归,基于 UnrollFactor 完全静态完成
template<int N, typename Lambda, typename IndexType>
constexpr void ALWAYS_INLINE unroll_impl(IndexType i, Lambda body) {
if constexpr (N > 0) {
// 编译期递归展开 N 次
body(i);
// 继续展开 N-1 次
unroll_impl<N - 1>(i + 1, body);
}
}
// 对外接口,控制展开因子 UnrollFactor,利用 constexpr 完全展开
template<int UnrollFactor, typename Lambda, typename IndexType>
constexpr void NEVER_INLINE unroll_loop(IndexType total, Lambda body) {
IndexType i = 0;
// 处理主循环部分,通过 UnrollFactor 完全展开
for (; i <= total - UnrollFactor; i += UnrollFactor) {
unroll_impl<UnrollFactor>(i, body);
}
// 处理剩余未能展开的部分
for (; i < total; ++i) {
body(i);
}
}
int main(int argc, char **argv) {
const int total = atoi(argv[1]);
int v[total] = {0};
// 用 lambda 实现循环逻辑,设置展开因子为 4
unroll_loop<8>(total, [&](int i) {
v[i] = i + total;
if (v[i] > 10) {
v[i] = 32;
}
});
// 输出结果
for (int i = 0; i < total; ++i) {
std::cout << v[i] << " ";
}
std::cout << std::endl;
return 0;
}
对应的 ARM 汇编代码如下:
void unroll_loop<8, main::{lambda(int)#1}>(int, main::{lambda(int)#1}) [clone .isra.0]:
stp x29, x30, [sp, -48]!
sub w8, w0, #7
mov x29, sp
cmp w8, 0
ble .L119
sub w5, w0, #8
add x3, x2, 4
lsr w4, w5, 3
add w4, w4, 1
ubfiz x6, x4, 5, 30
add x6, x1, x6
cmp x2, x6
ccmp x1, x3, 2, cc
ccmp w5, 31, 0, cs
bls .L5
lsr w5, w4, 2
mov w6, 128
adrp x3, .LC0
ld1r {v5.4s}, [x2]
movi v0.4s, 0x20
stp d8, d9, [sp, 16]
umaddl x5, w5, w6, x1
movi v4.4s, 0xa
str d10, [sp, 32]
movi v24.4s, 0x1
movi v23.4s, 0x2
movi v22.4s, 0x3
movi v21.4s, 0x4
movi v20.4s, 0x5
movi v19.4s, 0x6
movi v18.4s, 0x7
ldr q17, [x3, #:lo12:.LC0]
mov x3, x1
.L7:
mov v1.16b, v17.16b
add v17.4s, v17.4s, v0.4s
add v28.4s, v1.4s, v20.4s
add v6.4s, v1.4s, v22.4s
add v27.4s, v1.4s, v18.4s
add v7.4s, v1.4s, v23.4s
add v26.4s, v1.4s, v19.4s
add v25.4s, v1.4s, v21.4s
add v3.4s, v1.4s, v24.4s
add v9.4s, v28.4s, v5.4s
add v8.4s, v6.4s, v5.4s
add v29.4s, v26.4s, v5.4s
add v16.4s, v25.4s, v5.4s
add v3.4s, v3.4s, v5.4s
add v31.4s, v27.4s, v5.4s
add v30.4s, v7.4s, v5.4s
add v1.4s, v1.4s, v5.4s
cmge v28.4s, v4.4s, v9.4s
cmge v6.4s, v4.4s, v8.4s
cmge v26.4s, v4.4s, v29.4s
cmge v25.4s, v4.4s, v16.4s
cmge v2.4s, v4.4s, v1.4s
cmge v10.4s, v4.4s, v3.4s
cmge v27.4s, v4.4s, v31.4s
cmge v7.4s, v4.4s, v30.4s
bsl v28.16b, v9.16b, v0.16b
bsl v26.16b, v29.16b, v0.16b
bsl v25.16b, v16.16b, v0.16b
bif v3.16b, v0.16b, v10.16b
bsl v6.16b, v8.16b, v0.16b
bsl v27.16b, v31.16b, v0.16b
bif v1.16b, v0.16b, v2.16b
bsl v7.16b, v30.16b, v0.16b
zip1 v16.4s, v3.4s, v28.4s
zip2 v2.4s, v3.4s, v28.4s
zip1 v29.4s, v6.4s, v27.4s
zip1 v3.4s, v1.4s, v25.4s
zip1 v28.4s, v7.4s, v26.4s
zip2 v6.4s, v6.4s, v27.4s
zip2 v7.4s, v7.4s, v26.4s
zip1 v27.4s, v16.4s, v29.4s
zip1 v26.4s, v3.4s, v28.4s
zip2 v1.4s, v1.4s, v25.4s
zip2 v16.4s, v16.4s, v29.4s
zip2 v3.4s, v3.4s, v28.4s
zip1 v25.4s, v1.4s, v7.4s
zip2 v1.4s, v1.4s, v7.4s
zip1 v7.4s, v2.4s, v6.4s
zip2 v2.4s, v2.4s, v6.4s
zip1 v6.4s, v26.4s, v27.4s
zip2 v26.4s, v26.4s, v27.4s
zip1 v9.4s, v3.4s, v16.4s
zip1 v8.4s, v25.4s, v7.4s
zip2 v3.4s, v3.4s, v16.4s
zip2 v25.4s, v25.4s, v7.4s
stp q6, q26, [x3]
zip1 v6.4s, v1.4s, v2.4s
zip2 v1.4s, v1.4s, v2.4s
stp q9, q3, [x3, 32]
stp q8, q25, [x3, 64]
stp q6, q1, [x3, 96]
add x3, x3, 128
cmp x3, x5
bne .L7
and w5, w4, -4
lsl w3, w5, 3
cmp w4, w5
beq .L117
ldr w6, [x2]
mov w5, 32
mov w30, w5
mov w11, w5
add w6, w3, w6
mov w10, w5
cmp w6, 11
add w12, w3, 1
csel w6, w6, w5, lt
str w6, [x1, w3, sxtw 2]
add w18, w3, 2
add w9, w3, 3
ldr w5, [x2]
add w16, w3, 4
add w6, w3, 5
add w13, w3, 7
add w5, w12, w5
cmp w5, 11
csel w5, w5, w11, lt
str w5, [x1, w12, sxtw 2]
add w5, w3, 6
add w12, w3, 8
ldr w11, [x2]
add w11, w18, w11
cmp w11, 11
csel w11, w11, w10, lt
str w11, [x1, w18, sxtw 2]
ldr w10, [x2]
add w10, w9, w10
cmp w10, 11
csel w10, w10, w30, lt
str w10, [x1, w9, sxtw 2]
ldr w9, [x2]
add w9, w16, w9
cmp w9, 11
csel w9, w9, w30, lt
str w9, [x1, w16, sxtw 2]
ldr w7, [x2]
add w7, w6, w7
cmp w7, 11
csel w7, w7, w30, lt
str w7, [x1, w6, sxtw 2]
ldr w6, [x2]
add w6, w5, w6
cmp w6, 11
csel w6, w6, w30, lt
str w6, [x1, w5, sxtw 2]
ldr w5, [x2]
add w5, w13, w5
cmp w5, 11
csel w5, w5, w30, lt
str w5, [x1, w13, sxtw 2]
cmp w8, w12
ble .L117
ldr w6, [x2]
add w13, w3, 9
add w18, w3, 10
add w9, w3, 11
add w6, w12, w6
add w16, w3, 12
cmp w6, 11
csel w6, w6, w30, lt
str w6, [x1, w12, sxtw 2]
add w6, w3, 13
add w12, w3, 16
ldr w5, [x2]
add w5, w13, w5
cmp w5, 11
csel w5, w5, w30, lt
str w5, [x1, w13, sxtw 2]
add w5, w3, 14
add w13, w3, 15
ldr w11, [x2]
add w11, w18, w11
cmp w11, 11
csel w11, w11, w30, lt
str w11, [x1, w18, sxtw 2]
ldr w10, [x2]
add w10, w9, w10
cmp w10, 11
csel w10, w10, w30, lt
str w10, [x1, w9, sxtw 2]
ldr w9, [x2]
add w9, w16, w9
cmp w9, 11
csel w9, w9, w30, lt
str w9, [x1, w16, sxtw 2]
ldr w7, [x2]
add w7, w6, w7
cmp w7, 11
csel w7, w7, w30, lt
str w7, [x1, w6, sxtw 2]
ldr w6, [x2]
add w6, w5, w6
cmp w6, 11
csel w6, w6, w30, lt
str w6, [x1, w5, sxtw 2]
ldr w5, [x2]
add w5, w13, w5
cmp w5, 11
csel w5, w5, w30, lt
str w5, [x1, w13, sxtw 2]
cmp w8, w12
ble .L117
ldr w6, [x2]
add w7, w3, 17
add w15, w3, 18
add w14, w3, 20
add w6, w12, w6
add w13, w3, 21
cmp w6, 11
add w11, w3, 22
csel w6, w6, w30, lt
str w6, [x1, w12, sxtw 2]
add w6, w3, 19
add w3, w3, 23
ldr w5, [x2]
ldp d8, d9, [sp, 16]
add w5, w7, w5
cmp w5, 11
ldr d10, [sp, 32]
csel w5, w5, w30, lt
str w5, [x1, w7, sxtw 2]
ldr w10, [x2]
add w10, w15, w10
cmp w10, 11
csel w10, w10, w30, lt
str w10, [x1, w15, sxtw 2]
ldr w9, [x2]
add w9, w6, w9
cmp w9, 11
csel w9, w9, w30, lt
str w9, [x1, w6, sxtw 2]
ldr w8, [x2]
add w8, w14, w8
cmp w8, 11
csel w8, w8, w30, lt
str w8, [x1, w14, sxtw 2]
ldr w7, [x2]
add w7, w13, w7
cmp w7, 11
csel w7, w7, w30, lt
str w7, [x1, w13, sxtw 2]
ldr w6, [x2]
add w6, w11, w6
cmp w6, 11
csel w6, w6, w30, lt
str w6, [x1, w11, sxtw 2]
ldr w5, [x2]
add w5, w3, w5
cmp w5, 11
csel w5, w5, w30, lt
str w5, [x1, w3, sxtw 2]
.L82:
lsl w3, w4, 3
.L8:
cmp w0, w3
ble .L1
sxtw x7, w3
sub w5, w0, w3
add x5, x7, x5
sbfiz x8, x3, 2, 32
add x9, x1, x8
add x6, x2, 4
add x5, x1, x5, lsl 2
sub w4, w0, #1
cmp x2, x5
sub w4, w4, w3
ccmp x6, x9, 0, cc
sub w5, w0, w3
ccmp w4, 4, 0, ls
bls .L84
adrp x4, .LC1
dup v2.4s, w3
ld1r {v4.4s}, [x2]
adrp x6, .LC2
ldr q0, [x4, #:lo12:.LC1]
lsr w4, w5, 2
ldr q5, [x6, #:lo12:.LC2]
add v0.4s, v2.4s, v0.4s
movi v1.4s, 0xa
movi v3.4s, 0x20
add v2.4s, v2.4s, v5.4s
add v0.4s, v0.4s, v4.4s
cmge v5.4s, v1.4s, v0.4s
bif v0.16b, v3.16b, v5.16b
str q0, [x1, x8]
cmp w4, 1
bne .L120
.L85:
and w4, w5, -4
add w3, w4, w3
cmp w4, w5
beq .L1
ldr w4, [x2]
mov w6, 32
add w5, w3, 1
add w4, w3, w4
cmp w4, 11
csel w4, w4, w6, lt
str w4, [x1, w3, sxtw 2]
cmp w0, w5
ble .L1
ldr w4, [x2]
add w3, w3, 2
add w4, w5, w4
cmp w4, 11
csel w4, w4, w6, lt
str w4, [x1, w5, sxtw 2]
cmp w0, w3
ble .L1
ldr w0, [x2]
add w0, w3, w0
cmp w0, 11
csel w0, w0, w6, lt
str w0, [x1, w3, sxtw 2]
.L1:
ldp x29, x30, [sp], 48
ret
.L5:
mov x5, x1
mov w3, 0
mov w6, 32
b .L83
.L121:
str w7, [x5]
add w7, w3, 1
ldr w9, [x2]
add w7, w7, w9
cmp w7, 10
bgt .L68
.L122:
str w7, [x5, 4]
add w7, w3, 2
ldr w9, [x2]
add w7, w7, w9
cmp w7, 10
bgt .L70
.L123:
str w7, [x5, 8]
add w7, w3, 3
ldr w9, [x2]
add w7, w7, w9
cmp w7, 10
bgt .L72
.L124:
str w7, [x5, 12]
add w7, w3, 4
ldr w9, [x2]
add w7, w7, w9
cmp w7, 10
bgt .L74
.L125:
str w7, [x5, 16]
add w7, w3, 5
ldr w9, [x2]
add w7, w7, w9
cmp w7, 10
bgt .L76
.L126:
str w7, [x5, 20]
add w7, w3, 6
ldr w9, [x2]
add w7, w7, w9
cmp w7, 10
bgt .L78
.L127:
str w7, [x5, 24]
add w7, w3, 7
ldr w9, [x2]
add w7, w7, w9
cmp w7, 10
bgt .L80
.L128:
str w7, [x5, 28]
.L81:
add w3, w3, 8
add x5, x5, 32
cmp w8, w3
ble .L82
.L83:
ldr w7, [x2]
add w7, w3, w7
cmp w7, 10
ble .L121
str w6, [x5]
add w7, w3, 1
ldr w9, [x2]
add w7, w7, w9
cmp w7, 10
ble .L122
.L68:
str w6, [x5, 4]
add w7, w3, 2
ldr w9, [x2]
add w7, w7, w9
cmp w7, 10
ble .L123
.L70:
str w6, [x5, 8]
add w7, w3, 3
ldr w9, [x2]
add w7, w7, w9
cmp w7, 10
ble .L124
.L72:
str w6, [x5, 12]
add w7, w3, 4
ldr w9, [x2]
add w7, w7, w9
cmp w7, 10
ble .L125
.L74:
str w6, [x5, 16]
add w7, w3, 5
ldr w9, [x2]
add w7, w7, w9
cmp w7, 10
ble .L126
.L76:
str w6, [x5, 20]
add w7, w3, 6
ldr w9, [x2]
add w7, w7, w9
cmp w7, 10
ble .L127
.L78:
str w6, [x5, 24]
add w7, w3, 7
ldr w9, [x2]
add w7, w7, w9
cmp w7, 10
ble .L128
.L80:
str w6, [x5, 28]
b .L81
.L84:
ldr w4, [x2]
mov w5, 32
add w6, w3, 1
add w4, w3, w4
cmp w4, 11
csel w4, w4, w5, lt
str w4, [x1, x7, lsl 2]
cmp w0, w6
ble .L1
ldr w4, [x2]
add w7, w3, 2
add w4, w6, w4
cmp w4, 10
csel w5, w5, w4, gt
str w5, [x1, w6, sxtw 2]
cmp w0, w7
ble .L1
ldr w4, [x2]
mov w6, 32
add w5, w3, 3
add w4, w7, w4
cmp w4, 11
csel w4, w4, w6, lt
str w4, [x1, w7, sxtw 2]
cmp w0, w5
ble .L1
ldr w4, [x2]
mov w7, w6
add w6, w3, 4
add w4, w5, w4
cmp w4, 11
csel w4, w4, w7, lt
str w4, [x1, w5, sxtw 2]
cmp w0, w6
ble .L1
ldr w4, [x2]
add w5, w3, 5
add w4, w6, w4
cmp w4, 11
csel w4, w4, w7, lt
str w4, [x1, w6, sxtw 2]
cmp w0, w5
ble .L1
ldr w4, [x2]
add w6, w3, 6
add w4, w5, w4
cmp w4, 11
csel w4, w4, w7, lt
str w4, [x1, w5, sxtw 2]
cmp w0, w6
ble .L1
ldr w4, [x2]
add w3, w3, 7
add w4, w6, w4
cmp w4, 11
csel w4, w4, w7, lt
str w4, [x1, w6, sxtw 2]
cmp w0, w3
ble .L1
ldr w0, [x2]
add w0, w3, w0
cmp w0, 11
csel w0, w0, w7, lt
str w0, [x1, w3, sxtw 2]
b .L1
.L120:
add v2.4s, v4.4s, v2.4s
cmge v1.4s, v1.4s, v2.4s
bsl v1.16b, v2.16b, v3.16b
str q1, [x9, 16]
b .L85
.L117:
ldp d8, d9, [sp, 16]
ldr d10, [sp, 32]
b .L82
.L119:
mov w3, 0
b .L8
.LC3:
.string " "
main:
mov x0, x1
stp x29, x30, [sp, -80]!
mov w2, 10
mov x29, sp
ldr x0, [x0, 8]
mov x1, 0
stp x19, x20, [sp, 16]
stp x21, x22, [sp, 32]
str x23, [sp, 48]
bl strtol
mov x19, x0
str w0, [x29, 76]
sbfiz x0, x0, 2, 32
add x0, x0, 15
and x0, x0, -16
sub sp, sp, x0
mov x20, sp
mov x0, x20
str wzr, [x0], 4
cmp w19, 1
ble .L130
sxtw x2, w19
mov w1, 0
sub x2, x2, #1
lsl x2, x2, 2
bl memset
.L130:
mov w0, w19
add x2, x29, 76
mov x1, x20
bl void unroll_loop<8, main::{lambda(int)#1}>(int, main::{lambda(int)#1}) [clone .isra.0]
ldr w0, [x29, 76]
adrp x23, _ZSt4cout
add x21, x23, :lo12:_ZSt4cout
cmp w0, 0
ble .L131
adrp x22, .LC3
add x21, x23, :lo12:_ZSt4cout
add x22, x22, :lo12:.LC3
mov x19, 0
.L132:
ldr w1, [x20, x19, lsl 2]
mov x0, x21
add x19, x19, 1
bl std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
mov x1, x22
mov x2, 1
bl std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
ldr w0, [x29, 76]
cmp w0, w19
bgt .L132
.L131:
ldr x0, [x23, #:lo12:_ZSt4cout]
ldr x0, [x0, -24]
add x0, x21, x0
ldr x19, [x0, 240]
cbz x19, .L140
ldrb w0, [x19, 56]
cbz w0, .L134
ldrb w1, [x19, 67]
.L135:
mov x0, x21
bl std::basic_ostream<char, std::char_traits<char> >::put(char)
bl std::basic_ostream<char, std::char_traits<char> >::flush()
mov sp, x29
mov w0, 0
ldp x19, x20, [sp, 16]
ldp x21, x22, [sp, 32]
ldr x23, [sp, 48]
ldp x29, x30, [sp], 80
ret
.L134:
mov x0, x19
bl std::ctype<char>::_M_widen_init() const
ldr x2, [x19]
mov w1, 10
mov x0, x19
ldr x2, [x2, 48]
blr x2
and w1, w0, 255
b .L135
.L140:
bl std::__throw_bad_cast()
_GLOBAL__sub_I_main:
stp x29, x30, [sp, -32]!
mov x29, sp
str x19, [sp, 16]
adrp x19, .LANCHOR0
add x19, x19, :lo12:.LANCHOR0
mov x0, x19
bl std::ios_base::Init::Init() [complete object constructor]
mov x1, x19
adrp x2, __dso_handle
ldr x19, [sp, 16]
add x2, x2, :lo12:__dso_handle
ldp x29, x30, [sp], 32
adrp x0, _ZNSt8ios_base4InitD1Ev
add x0, x0, :lo12:_ZNSt8ios_base4InitD1Ev
b __cxa_atexit
.LC0:
.word 0
.word 8
.word 16
.word 24
.LC1:
.word 0
.word 1
.word 2
.word 3
.LC2:
.word 4
.word 5
.word 6
.word 7
.zero 1
Update:unroll 与 SIMD
unrolling 的好处有两个:
- 为 SIMD 提供前提
- 减少分支判断
对于 SIMD,并不是说只要 unrolling 了,就一定能做 SIMD 操作。
通过在线编译工具 https://godbolt.org/ 可以看到下面三个代码片段中,第一个是 SIMD 友好的,第二个也能部分走 SIMD,第三个则完全没有走 SIMD。
(1) SIMD 非常友好
unroll_loop<16>(12008, [&](ub8 i) ALWAYS_INLINE {
itemCountsPtr[i] = hashCountsPtr[i] + i;
});
(2) 部分走了 SIMD
unroll_loop<16>(12008, [&](ub8 i) ALWAYS_INLINE {
itemCountsPtr[i] = hashCountsPtr[i << 2] + i;
});
(3) 没有生成 SIMD 代码
unroll_loop<16>(12008, [&](ub8 i) ALWAYS_INLINE {
itemCountsPtr[i] = hashCountsPtr[i * i] + i;
});
为什么呢?我想这大概是和内存的访问模式有关。
- (1) 的内存访问模式非常规整,连续内存的操作,很适合 SIMD。
- (2) 的内存访问模式则复杂一些,右侧的内存是跳着访问的。这时候还能看到可以生成部分 SIMD 代码。
- (3) 则完全没有生成任何 SIMD 逻辑。这是因为 hashCountsPtr[i*i] 对应的内存地址完全不连续,1,4,9,16,25,etc。 SIMD 不能处理这样的内存访问模式。
为了理解得更清晰,我们看几个 SIMD 指令的执行细节。首先看一个从内存加载数据的指令:
ld4 {v20.2d - v23.2d}, [x0], 64:
解释如下:
- 这是一个 NEON SIMD 指令,ld4 表示从内存中一次性加载 4 个向量寄存器。
- {v20.2d - v23.2d} 表示将 4 个寄存器 v20 到 v23 加载为 64 位(2 个 double 精度浮点数)元素。
- [x0] 表示从寄存器 x0 指向的地址读取数据,并在加载后通过 , 64 自动增加 x0 的地址偏移量 64 字节。
可以看到,ld4 只能加载 [x0] 地址处连续若干个字节的数据,无法跳着加载。
再看一个ARM NEON 存储指令:
stp q31, q31, [sp, 256] :
解释如下:
- stp:这是 ARM 的 Store Pair 指令,用于将两个 128 位的向量寄存器(NEON 寄存器)一次性存储到内存中。
- q31, q31:表示要存储的两个 128 位寄存器。此处是 NEON 寄存器 q31,它存储了两次(即存储 q31
的内容到连续的两个内存位置)。 - [sp, 256]:表示存储地址是 sp (栈指针)加上偏移量 256 字节的地址。即从 sp +
256 开始,将 q31 的数据存储到内存中。
也可以看到,把数据写往内存,也只能顺着写连续内存。
对于下面这样的间接寻址操作,SIMD 是非常不擅长的:
s[i] = m[v[i]];
非要用 SIMD 实现,伪代码如下:
ld1 {v0.4s}, [x0] // 加载索引向量 v[i],向量指令
ldr w1, [m, v0.s[0], lsl #2] // 从 m[v[0]] 加载,普通指令
ldr w2, [m, v0.s[1], lsl #2] // 从 m[v[1]] 加载,普通指令
ldr w3, [m, v0.s[2], lsl #2] // 从 m[v[2]] 加载,普通指令
ldr w4, [m, v0.s[3], lsl #2] // 从 m[v[3]] 加载,普通指令
ins v1.s[0], w1 // 插入 w1 到 v1 的槽位,普通指令
ins v1.s[1], w2 // 插入 w2 到 v1 的槽位,普通指令
ins v1.s[2], w3 // 插入 w3 到 v1 的槽位,普通指令
ins v1.s[3], w4 // 插入 w4 到 v1 的槽位,普通指令
st1 {v1.4s}, [x1] // 将结果存储到 s[i],向量指令
也就是说,非连续内存的访问,只适合用普通内存访问指令。
有没有允许间接寻址的 SIMD 指令呢?有。例如 AVX-512 的 “gather” 和 “scatter” 指令能支持 s[i] = m[v[i]]; 这种模式的高效访问。
在 Intel Haswell 架构里引入了 Gather 特性。它使得CPU可以使用向量索引存储器编址从存储器取非连续的数据元素。这些gather指令引入了一种新的存储器寻址形式,该形式由一个 基地址寄存器(仍然是通用目的寄存器)和通过一个 向量寄存器(XMM 或 YMM)所指定的多个索引构成。数据元素大小支持32位与64位,并且数据类型支持浮点型和整型。
.
更多关于 AVX512: https://blog.youkuaiyun.com/zenny_chen/article/details/130582827