* xorIn and copyOut function pointers cause input and output data
to be moved to heap. This degrades performance of calling code.
* This change removes usage of those function pointers. We will always
use unaligned implementation as it's faster (but may crash on some
systems)
* Benchmark compares generic vs unaligned xorIn and copyOut
benchmark old ns/op new ns/op delta
BenchmarkPermutationFunction-4 463 815 +76.03%
BenchmarkShake128_MTU-4 4443 8180 +84.11%
BenchmarkShake256_MTU-4 4739 9060 +91.18%
BenchmarkShake256_16x-4 71886 132629 +84.50%
BenchmarkShake256_1MiB-4 3695138 6649012 +79.94%
BenchmarkCShake128_448_16x-4 21210 24611 +16.03%
BenchmarkCShake128_1MiB-4 3009342 3396496 +12.87%
BenchmarkCShake256_448_16x-4 26034 27785 +6.73%
BenchmarkCShake256_1MiB-4 3654713 3829404 +4.78%
The performance improvement comes from the fact that on Skylake
"add mem, reg" splits into 2 uops - one arithmetic uop and another one
for loading a value from mem.
However, changing operand order to "add reg, mem" splits into 3 uops:
one for arithmetic op, one for load and one additional one for storing
the result back.
Using separated instruction for loading/storing helps to parallelize
execution (load/store and arithmetic instruction is done in parallel
if possible)
For details, see: https://www.agner.org/optimize/instruction_tables.pdf
New: BenchmarkFp503StrongReduce-4 300000000 5.57 ns/op
Old: BenchmarkFp503StrongReduce-4 200000000 8.60 ns/op
This just improves one function, but more functions can be improved
* Makefile
* makefile: tools for profiling
* sidh: use SIMD for performing CSWAP
Loads data into 128-bit XMM registers and performs conditional swap.
This is probably less useful for SIDH, but will be useful for cSIDH