(The issue was reported by Shay Gueron.)
The final reduction in Montgomery multiplication computes if (X >= m) then X =
X - m else X = X
In OpenSSL, this was done by computing T = X - m, doing a constant-time
selection of the *addresses* of X and T, and loading from the resulting
address. But this is not cache-neutral.
This patch changes the behaviour by loading both X and T into registers, and
doing a constant-time selection of the *values*.
TODO(fork): only some of the fixes from the original patch still apply to
the 1.0.2 code.