In case of x86, there are few implementation of the montgomery reduction
and multiplication. At runtime, library chooses most performant
implementation according to information received from CPUID. The
implementation then is assigned to a function pointer which gets CALL'd
during program execution.
The problem is a mixture of following; function pointer points to an
assembly function, it's arguments are also pointers to some data and
arguments passed to the call are allocated on a stack before a call.
As variable can't be tagged as go:noescape, the go compiler will move
arguments passed to the call from stack to heap. This causes significant
performance degradataion.
The solution is not to use function pointer. Instead, both redc and mul,
will check at runtime CPU capabilities and do the JMP to the correct
part of the text. Thanks to branch prediction the cost of the solution
is minimal and smaller than function call. This patch also removes all
heap allocations done by functions operating on prime field.
The other goal of this patch is to remove x86 specific code from
arith_decl.go, which will be also compiled for ARM arch in near future.
Results:
--------
benchmark old ns/op new ns/op delta
BenchmarkFp2ElementMul 428 194 -54.67%
BenchmarkFp2ElementInv 67353 34447 -48.86%
BenchmarkFp2ElementSquare 335 139 -58.51%
BenchmarkFp503MontgomeryReduce 26.3 22.8 -13.31%
BenchmarkSidhKeyAgreementP503 12451199 6402396 -48.58%
BenchmarkAliceKeyGenPubP503 7349333 3590954 -51.14%
BenchmarkBobKeyGenPubP503 8253676 4094141 -50.40%
BenchmarkSharedSecretAliceP503 5888022 2916821 -50.46%
BenchmarkSharedSecretBobP503 6908018 3436713 -50.25%
Comparision with P751:
----------------------
BenchmarkBobKeyGenPubP751 13616876
BenchmarkBobKeyGenPubP503 4094141
BenchmarkSharedSecretAliceP751 9870216
BenchmarkSharedSecretAliceP503 2916821
There is a cost - possibly CMPB & JE each time mul and redc is called.
Also patch introduces arith_amd64.go file which keeps some variables -
this currently needs to be done for each field. Probably it could be
possible to get rid of it at some point.