PERF: sidh-p503: Split sub and add into 2 uops instead of 3 (#8)
The performance improvement comes from the fact that on Skylake
"add mem, reg" splits into 2 uops - one arithmetic uop and another one
for loading a value from mem.
However, changing operand order to "add reg, mem" splits into 3 uops:
one for arithmetic op, one for load and one additional one for storing
the result back.
Using separated instruction for loading/storing helps to parallelize
execution (load/store and arithmetic instruction is done in parallel
if possible)
For details, see: https://www.agner.org/optimize/instruction_tables.pdf
New: BenchmarkFp503StrongReduce-4 300000000 5.57 ns/op
Old: BenchmarkFp503StrongReduce-4 200000000 8.60 ns/op
This just improves one function, but more functions can be improved