natural assembly
- no register dependency, no penalty
ld1 {v0.4s}, [r0], #16
fmla v10.4s, v16.4s, v24.s[0]
fmla v11.4s, v16.4s, v24.s[1]
fmla v12.4s, v16.4s, v24.s[2]
fmla v13.4s, v16.4s, v24.s[3]
A53
- 128bit vector load cannot be dual issued with fmla, wait 2 cycles
- 64bit vector load cannot be dual issued with fmla, wait 1 cycle
- 64bit integer load can be dual issued with fmla, no penalty
- pointer update can be dual issued with fmla, no penalty
- 64bit vector load and 64bit vector insert can be dual issued, no penalty
- any vector load cannot be issued on the 4th cycle of each fmla (enters the accumulator pipeline)
practical guide
- use 64bit vector load only
- issue vector load every three fmla
- 1 cycle to load 64bit, dual issue with the previous interleaved 64bit insert
- load the remaining 64bit into integer register,