instruction latency 1/throughput
06_2A
VPERM2F128 ymm1, ymm2, ymm3, imm 1 1
VPERMILPD/PS ymm1, ymm2, ymm3 1 1
VSHUFPD/PS ymm1, ymm2, ymm3, imm 1 1
PSHUFB xmm1,xmm2 1 1 1 3 0.5 0.5 1 2
这个不一定,最快 1,0.5 ,2个throuhput
SHUFPD xmm, xmm,imm8,1 1 1 1 1 1 1 1
最快1,1
比整数shuffle慢些
shuffps
SHUFPS xmm, xmm,imm8,1 1 1 2 1 1 1 1
有些机器上比shuffpd 慢些
INSERTPS xmm1, xmm2, imm 1 1 1 1 1 1
EXTRACTPS xmm1, xmm2, imm 3 2 5 1 1 1
VEXTRACTF128 ymm1, ymm2, imm 1 1 avx