This introduces a hook for the OpenSSL assembly.
Change-Id: I35e0588f0ed5bed375b12f738d16c9f46ceedeea
Reviewed-on: https://boringssl-review.googlesource.com/27592
Reviewed-by: Adam Langley <alangley@gmail.com>
Rather than writing the answer into the output, it wrote it into some
awkwardly-named temporaries. Thanks to Daniel Hirche for reporting this
issue!
Bug: chromium:825273
Change-Id: I5def4be045cd1925453c9873218e5449bf25e3f5
Reviewed-on: https://boringssl-review.googlesource.com/26785
Reviewed-by: Adam Langley <agl@google.com>
Commit-Queue: David Benjamin <davidben@google.com>
CQ-Verified: CQ bot account: commit-bot@chromium.org <commit-bot@chromium.org>
Change-Id: Ie2368dc9f6be791b7c3ad1c610dcd603634be6e4
Reviewed-on: https://boringssl-review.googlesource.com/26244
Reviewed-by: David Benjamin <davidben@google.com>
Commit-Queue: David Benjamin <davidben@google.com>
CQ-Verified: CQ bot account: commit-bot@chromium.org <commit-bot@chromium.org>
Change-Id: Ic2e9f54f5ced053c1463d5c09a74db5b2a3ea098
Reviewed-on: https://boringssl-review.googlesource.com/26224
Reviewed-by: David Benjamin <davidben@google.com>
Commit-Queue: David Benjamin <davidben@google.com>
CQ-Verified: CQ bot account: commit-bot@chromium.org <commit-bot@chromium.org>
This reuses wnaf.c's window scheduling, but has access to the tuned
field arithemetic and pre-computed base point table. Unlike wnaf.c, we
do not make the points affine as it's not worth it for a single table.
(We already precomputed the base point table.)
Annoyingly, 32-bit x86 gets slower by a bit, but the other platforms are
faster. My guess is that that the generic code gets to use the
bn_mul_mont assembly and the compiler, faced with the increased 32-bit
register pressure and the extremely register-poor x86, is making
bad decisions on the otherwise P-256-tuned C code. The three platforms
that see much larger gains are significantly more important than 32-bit
x86 at this point, so go with this change.
armv7a (Nexus 5X) before/after [+14.4%]:
Did 2703 ECDSA P-256 verify operations in 5034539us (536.9 ops/sec)
Did 3127 ECDSA P-256 verify operations in 5091379us (614.2 ops/sec)
aarch64 (Nexus 5X) before/after [+9.2%]:
Did 6783 ECDSA P-256 verify operations in 5031324us (1348.2 ops/sec)
Did 7410 ECDSA P-256 verify operations in 5033291us (1472.2 ops/sec)
x86 before/after [-2.7%]:
Did 8961 ECDSA P-256 verify operations in 10075901us (889.3 ops/sec)
Did 8568 ECDSA P-256 verify operations in 10003001us (856.5 ops/sec)
x86_64 before/after [+8.6%]:
Did 29808 ECDSA P-256 verify operations in 10008662us (2978.2 ops/sec)
Did 32528 ECDSA P-256 verify operations in 10057137us (3234.3 ops/sec)
Change-Id: I5fa643149f5bfbbda9533e3008baadfee9979b93
Reviewed-on: https://boringssl-review.googlesource.com/25684
Reviewed-by: Adam Langley <agl@google.com>
Commit-Queue: Adam Langley <agl@google.com>
CQ-Verified: CQ bot account: commit-bot@chromium.org <commit-bot@chromium.org>
Now that we have 64-bit C code, courtesy of fiat-crypto, the tradeoff
for carrying the assembly changes:
Assembly:
Did 16000 Curve25519 base-point multiplication operations in 1059932us (15095.3 ops/sec)
Did 16000 Curve25519 arbitrary point multiplication operations in 1060023us (15094.0 ops/sec)
fiat64:
Did 39000 Curve25519 base-point multiplication operations in 1004712us (38817.1 ops/sec)
Did 14000 Curve25519 arbitrary point multiplication operations in 1006827us (13905.1 ops/sec)
The assembly is still about 9% faster than fiat64, but fiat64 gets to
use the Ed25519 tables for the base point multiplication, so overall it
is actually faster to disable the assembly:
>>> 1/(1/15094.0 + 1/15095.3)
7547.324986004976
>>> 1/(1/38817.1 + 1/13905.1)
10237.73016319501
(At the cost of touching a 30kB table.)
The assembly implementation is no longer pulling its weight. Remove it
and use the fiat code in all build configurations.
Change-Id: Id736873177d5568bb16ea06994b9fcb1af104e33
Reviewed-on: https://boringssl-review.googlesource.com/25524
Reviewed-by: Adam Langley <agl@google.com>
Our 64-bit performance was much lower than it could have been, since we
weren't using the 64-bit multipliers. Fortunately, fiat-crypto is
awesome, so this is just a matter of synthesizing new code and
integration work.
Functions without the signature fiat-crypto curly braces were written by
hand and warrant more review. (It's just redistributing some bits.)
These use the donna variants which takes (and proves) some of the
instruction scheduling from donna as that's significantly faster.
Glancing over things, I suspect but have not confirmed the gap is due to
this:
https://github.com/mit-plv/fiat-crypto/pull/295#issuecomment-356892413
Clang without OPENSSL_SMALL (ECDH omitted since that uses assembly and
is unaffected by this CL).
Before:
Did 105149 Ed25519 key generation operations in 5025208us (20924.3 ops/sec)
Did 125000 Ed25519 signing operations in 5024003us (24880.6 ops/sec)
Did 37642 Ed25519 verify operations in 5072539us (7420.7 ops/sec)
After:
Did 206000 Ed25519 key generation operations in 5020547us (41031.4 ops/sec)
Did 227000 Ed25519 signing operations in 5005232us (45352.5 ops/sec)
Did 69840 Ed25519 verify operations in 5004769us (13954.7 ops/sec)
Clang + OPENSSL_SMALL:
Before:
Did 68598 Ed25519 key generation operations in 5024629us (13652.4 ops/sec)
Did 73000 Ed25519 signing operations in 5067837us (14404.6 ops/sec)
Did 36765 Ed25519 verify operations in 5078684us (7239.1 ops/sec)
Did 74000 Curve25519 base-point multiplication operations in 5016465us (14751.4 ops/sec)
Did 45600 Curve25519 arbitrary point multiplication operations in 5034680us (9057.2 ops/sec)
After:
Did 117315 Ed25519 key generation operations in 5021860us (23360.9 ops/sec)
Did 126000 Ed25519 signing operations in 5003521us (25182.3 ops/sec)
Did 64974 Ed25519 verify operations in 5047790us (12871.8 ops/sec)
Did 134000 Curve25519 base-point multiplication operations in 5058946us (26487.7 ops/sec)
Did 86000 Curve25519 arbitrary point multiplication operations in 5050478us (17028.1 ops/sec)
GCC without OPENSSL_SMALL (ECDH omitted since that uses assembly and
is unaffected by this CL).
Before:
Did 35552 Ed25519 key generation operations in 5030756us (7066.9 ops/sec)
Did 38286 Ed25519 signing operations in 5001648us (7654.7 ops/sec)
Did 10584 Ed25519 verify operations in 5068158us (2088.3 ops/sec)
After:
Did 92158 Ed25519 key generation operations in 5024021us (18343.5 ops/sec)
Did 99000 Ed25519 signing operations in 5011908us (19753.0 ops/sec)
Did 31122 Ed25519 verify operations in 5069878us (6138.6 ops/sec)
Change-Id: Ic0c24d50b4ee2bbc408b94965e9d63319936107d
Reviewed-on: https://boringssl-review.googlesource.com/24805
Commit-Queue: David Benjamin <davidben@google.com>
CQ-Verified: CQ bot account: commit-bot@chromium.org <commit-bot@chromium.org>
Reviewed-by: Adam Langley <agl@google.com>
Adding 51-bit limbs will require two implementations of most of the
field operations. Group them together to make this more manageable. Also
move the representation-independent functions to the end.
Change-Id: I264e8ac64318a1d5fa72e6ad6f7ccf2f0a2c2be9
Reviewed-on: https://boringssl-review.googlesource.com/24804
Commit-Queue: David Benjamin <davidben@google.com>
CQ-Verified: CQ bot account: commit-bot@chromium.org <commit-bot@chromium.org>
Reviewed-by: Adam Langley <agl@google.com>
These are also constants that depend on the field representation.
Change-Id: I22333c099352ad64eb27fe15ffdc38c6ae7c07ff
Reviewed-on: https://boringssl-review.googlesource.com/24746
Commit-Queue: David Benjamin <davidben@google.com>
CQ-Verified: CQ bot account: commit-bot@chromium.org <commit-bot@chromium.org>
Reviewed-by: Adam Langley <agl@google.com>
This is to make it easier to add new field element representations. The
Ed25519 logic in the script is partially adapted from RFC 8032's Python
code, but I replaced the point addition logic with the naive textbook
formula since this script only cares about being obviously correct.
Change-Id: I0b90bf470993c177070fd1010ac5865fedb46c82
Reviewed-on: https://boringssl-review.googlesource.com/24745
Commit-Queue: David Benjamin <davidben@google.com>
CQ-Verified: CQ bot account: commit-bot@chromium.org <commit-bot@chromium.org>
Reviewed-by: Adam Langley <agl@google.com>
This is in preparation for writing a script to generate them. I'm
manually moving the existing tables over so it will be easier to confirm
the script didn't change the values.
Change-Id: Id83e95c80d981e19d1179d45bf47559b3e1fc86e
Reviewed-on: https://boringssl-review.googlesource.com/24744
Commit-Queue: David Benjamin <davidben@google.com>
CQ-Verified: CQ bot account: commit-bot@chromium.org <commit-bot@chromium.org>
Reviewed-by: Adam Langley <agl@google.com>
fiat-crypto only generates fe_mul and fe_sq, but the original Ed25519
implementation we had also had fe_sq2 for computing 2*f^2. Previously,
we inlined a version of fe_mul.
Instead, we could implement it with fe_sq and fe_add. Performance-wise,
this seems to not regress. If anything, it makes it faster?
Before (clang, run for 10 seconds):
Did 243000 Ed25519 key generation operations in 10025910us (24237.2 ops/sec)
Did 250000 Ed25519 signing operations in 10035580us (24911.4 ops/sec)
Did 73305 Ed25519 verify operations in 10071101us (7278.7 ops/sec)
Did 184000 Curve25519 base-point multiplication operations in 10040138us (18326.4 ops/sec)
Did 186000 Curve25519 arbitrary point multiplication operations in 10052721us (18502.5 ops/sec)
After (clang, run for 10 seconds):
Did 242424 Ed25519 key generation operations in 10013117us (24210.6 ops/sec)
Did 253000 Ed25519 signing operations in 10011744us (25270.3 ops/sec)
Did 73899 Ed25519 verify operations in 10048040us (7354.6 ops/sec)
Did 194000 Curve25519 base-point multiplication operations in 10005389us (19389.6 ops/sec)
Did 195000 Curve25519 arbitrary point multiplication operations in 10028443us (19444.7 ops/sec)
Before (clang + OPENSSL_SMALL, run for 10 seconds):
Did 144000 Ed25519 key generation operations in 10019344us (14372.2 ops/sec)
Did 146000 Ed25519 signing operations in 10011653us (14583.0 ops/sec)
Did 74052 Ed25519 verify operations in 10005789us (7400.9 ops/sec)
Did 150000 Curve25519 base-point multiplication operations in 10007468us (14988.8 ops/sec)
Did 91392 Curve25519 arbitrary point multiplication operations in 10057678us (9086.8 ops/sec)
After (clang + OPENSSL_SMALL, run for 10 seconds):
Did 144000 Ed25519 key generation operations in 10066724us (14304.6 ops/sec)
Did 148000 Ed25519 signing operations in 10062043us (14708.7 ops/sec)
Did 74820 Ed25519 verify operations in 10058557us (7438.4 ops/sec)
Did 151000 Curve25519 base-point multiplication operations in 10063492us (15004.7 ops/sec)
Did 90402 Curve25519 arbitrary point multiplication operations in 10049141us (8996.0 ops/sec)
Change-Id: I31e9f61833492c3ff2dfd78e1dee5e06f43c850f
Reviewed-on: https://boringssl-review.googlesource.com/24724
Reviewed-by: Adam Langley <agl@google.com>
Chromium's licenses.py is a little finicky.
Change-Id: I015a3565eb8f3cfecb357d142facc796a9c80888
Reviewed-on: https://boringssl-review.googlesource.com/24784
Reviewed-by: Adam Langley <agl@google.com>
The P-224 implementation was missing the optimization to avoid doing
extra work when asking for only one coordinate (ECDH and ECDSA both
involve an x-coordinate query). The P-256 implementation was missing the
optimization to do one less Montgomery reduction.
TODO - Benchmarks
Change-Id: I268d9c24737c6da9efaf1c73395b73dd97355de7
Reviewed-on: https://boringssl-review.googlesource.com/24690
Reviewed-by: Adam Langley <agl@google.com>
Commit-Queue: Adam Langley <agl@google.com>
CQ-Verified: CQ bot account: commit-bot@chromium.org <commit-bot@chromium.org>
These are remnants of the old code which had a bunch of ftmp variables.
Change-Id: Id14cf414cb67ff08e240970767f7a5a58e883ce4
Reviewed-on: https://boringssl-review.googlesource.com/24689
Reviewed-by: Adam Langley <agl@google.com>
It requires a handful of additional intrinsics for now.
Fiat's freeze function only works on the tight bounds, so fe_isnonzero
gains an extra fe_carry. But all other calls of fe_tobytes are of tight
bounds anyway.
Change-Id: I834858cee7863c7344e456d7a7dbf4f414f04ae5
Reviewed-on: https://boringssl-review.googlesource.com/24545
Reviewed-by: Adam Langley <agl@google.com>
These date to the old code and have been replaced by the fe and fe_loose
bounds in the header file. Also fix up a comment that the comment
converter didn't manage to convert.
Change-Id: I2e3ea867a8cea2b347d09c304a17e532b2e36545
Reviewed-on: https://boringssl-review.googlesource.com/24525
Commit-Queue: Adam Langley <agl@google.com>
Reviewed-by: Adam Langley <agl@google.com>
CQ-Verified: CQ bot account: commit-bot@chromium.org <commit-bot@chromium.org>
Change-Id: Ie4060121f6bc8da07d87db8ec8133ea17e99e1fe
Reviewed-on: https://boringssl-review.googlesource.com/24344
Reviewed-by: David Benjamin <davidben@google.com>
Commit-Queue: David Benjamin <davidben@google.com>
CQ-Verified: CQ bot account: commit-bot@chromium.org <commit-bot@chromium.org>
It actually works fine. I just forgot one of the typedefs last time.
This gives a roughly 2x improvement on P-256 in clang-cl +
OPENSSL_SMALL, the configuration used by Chrome.
Before:
Did 1302 ECDH P-256 operations in 1015000us (1282.8 ops/sec)
Did 4250 ECDSA P-256 signing operations in 1047000us (4059.2 ops/sec)
Did 1750 ECDSA P-256 verify operations in 1094000us (1599.6 ops/sec)
After:
Did 3250 ECDH P-256 operations in 1078000us (3014.8 ops/sec)
Did 8250 ECDSA P-256 signing operations in 1016000us (8120.1 ops/sec)
Did 3250 ECDSA P-256 verify operations in 1063000us (3057.4 ops/sec)
(These were taken on a VM, so the measurements are extremely noisy, but
this sort of improvement is visible regardless.)
Alas, we do need a little extra bit of fiddling because division does
not work (crbug.com/787617).
Bug: chromium:787617
Update-Note: This removes the MSan uint128_t workaround which does not
appear to be necessary anymore.
Change-Id: I8361314608521e5bdaf0e7eeae7a02c33f55c69f
Reviewed-on: https://boringssl-review.googlesource.com/23984
Reviewed-by: Adam Langley <agl@google.com>
Commit-Queue: Adam Langley <agl@google.com>
CQ-Verified: CQ bot account: commit-bot@chromium.org <commit-bot@chromium.org>
The fiat-crypto-generated code uses the Montgomery form implementation
strategy, for both 32-bit and 64-bit code.
64-bit throughput seems slower, but the difference is smaller than noise between repetitions (-2%?)
32-bit throughput has decreased significantly for ECDH (-40%). I am
attributing this to the change from varibale-time scalar multiplication
to constant-time scalar multiplication. Due to the same bottleneck,
ECDSA verification still uses the old code (otherwise there would have
been a 60% throughput decrease). On the other hand, ECDSA signing
throughput has increased slightly (+10%), perhaps due to the use of a
precomputed table of multiples of the base point.
64-bit benchmarks (Google Cloud Haswell):
with this change:
Did 9126 ECDH P-256 operations in 1009572us (9039.5 ops/sec)
Did 23000 ECDSA P-256 signing operations in 1039832us (22119.0 ops/sec)
Did 8820 ECDSA P-256 verify operations in 1024242us (8611.2 ops/sec)
master (40e8c921ca):
Did 9340 ECDH P-256 operations in 1017975us (9175.1 ops/sec)
Did 23000 ECDSA P-256 signing operations in 1039820us (22119.2 ops/sec)
Did 8688 ECDSA P-256 verify operations in 1021108us (8508.4 ops/sec)
benchmarks on ARMv7 (LG Nexus 4):
with this change:
Did 150 ECDH P-256 operations in 1029726us (145.7 ops/sec)
Did 506 ECDSA P-256 signing operations in 1065192us (475.0 ops/sec)
Did 363 ECDSA P-256 verify operations in 1033298us (351.3 ops/sec)
master (2fce1beda0):
Did 245 ECDH P-256 operations in 1017518us (240.8 ops/sec)
Did 473 ECDSA P-256 signing operations in 1086281us (435.4 ops/sec)
Did 360 ECDSA P-256 verify operations in 1003846us (358.6 ops/sec)
64-bit tables converted as follows:
import re, sys, math
p = 2**256 - 2**224 + 2**192 + 2**96 - 1
R = 2**256
def convert(t):
x0, s1, x1, s2, x2, s3, x3 = t.groups()
v = int(x0, 0) + 2**64 * (int(x1, 0) + 2**64*(int(x2,0) + 2**64*(int(x3, 0)) ))
w = v*R%p
y0 = hex(w%(2**64))
y1 = hex((w>>64)%(2**64))
y2 = hex((w>>(2*64))%(2**64))
y3 = hex((w>>(3*64))%(2**64))
ww = int(y0, 0) + 2**64 * (int(y1, 0) + 2**64*(int(y2,0) + 2**64*(int(y3, 0)) ))
if ww != v*R%p:
print(x0,x1,x2,x3)
print(hex(v))
print(y0,y1,y2,y3)
print(hex(w))
print(hex(ww))
assert 0
return '{'+y0+s1+y1+s2+y2+s3+y3+'}'
fe_re = re.compile('{'+r'(\s*,\s*)'.join(r'(\d+|0x[abcdefABCDEF0123456789]+)' for i in range(4)) + '}')
print (re.sub(fe_re, convert, sys.stdin.read()).rstrip('\n'))
32-bit tables converted from 64-bit tables
Change-Id: I52d6e5504fcb6ca2e8b0ee13727f4500c80c1799
Reviewed-on: https://boringssl-review.googlesource.com/23244
Commit-Queue: Adam Langley <agl@google.com>
Reviewed-by: Adam Langley <agl@google.com>
CQ-Verified: CQ bot account: commit-bot@chromium.org <commit-bot@chromium.org>
Each operation was translated from fiat-crypto output using fiat-crypto
prettyprint.py. For example fe_mul is synthesized in
https://github.com/mit-plv/fiat-crypto/blob/master/src/Specific/X25519/C32/femul.v,
and shown in the last Coq-compatible form at
https://github.com/mit-plv/fiat-crypto/blob/master/src/Specific/X25519/C32/femulDisplay.log.
Benchmarks on Google Cloud's unidentified Intel Xeon with AVX2:
git checkout $VARIANT && ( cd build && rm -rf * && CC=clang CXX=clang++ cmake -GNinja -DCMAKE_TOOLCHAIN_FILE=../util/32-bit-toolchain.cmake -DCMAKE_BUILD_TYPE=Release .. && ninja && ./tool/bssl speed -filter 25519 )
this branch:
Did 11382 Ed25519 key generation operations in 1053046us (10808.6 ops/sec)
Did 11169 Ed25519 signing operations in 1038080us (10759.3 ops/sec)
Did 2925 Ed25519 verify operations in 1001346us (2921.1 ops/sec)
Did 12000 Curve25519 base-point multiplication operations in 1084851us (11061.4 ops/sec)
Did 3850 Curve25519 arbitrary point multiplication operations in 1085565us (3546.5 ops/sec)
Did 11466 Ed25519 key generation operations in 1049821us (10921.9 ops/sec)
Did 11000 Ed25519 signing operations in 1013317us (10855.4 ops/sec)
Did 3047 Ed25519 verify operations in 1043846us (2919.0 ops/sec)
Did 12000 Curve25519 base-point multiplication operations in 1068924us (11226.2 ops/sec)
Did 3850 Curve25519 arbitrary point multiplication operations in 1090598us (3530.2 ops/sec)
Did 10309 Ed25519 key generation operations in 1003320us (10274.9 ops/sec)
Did 11000 Ed25519 signing operations in 1017862us (10807.0 ops/sec)
Did 3135 Ed25519 verify operations in 1098624us (2853.6 ops/sec)
Did 9000 Curve25519 base-point multiplication operations in 1046608us (8599.2 ops/sec)
Did 3132 Curve25519 arbitrary point multiplication operations in 1038963us (3014.5 ops/sec)
master:
Did 11564 Ed25519 key generation operations in 1068762us (10820.0 ops/sec)
Did 11104 Ed25519 signing operations in 1024278us (10840.8 ops/sec)
Did 3206 Ed25519 verify operations in 1049179us (3055.7 ops/sec)
Did 12000 Curve25519 base-point multiplication operations in 1073619us (11177.1 ops/sec)
Did 3550 Curve25519 arbitrary point multiplication operations in 1000279us (3549.0 ops/sec)
andreser@linux-andreser:~/boringssl$ build/tool/bssl speed -filter 25519
Did 11760 Ed25519 key generation operations in 1072495us (10965.1 ops/sec)
Did 10800 Ed25519 signing operations in 1003486us (10762.5 ops/sec)
Did 3245 Ed25519 verify operations in 1080399us (3003.5 ops/sec)
Did 12000 Curve25519 base-point multiplication operations in 1076021us (11152.2 ops/sec)
Did 3570 Curve25519 arbitrary point multiplication operations in 1005087us (3551.9 ops/sec)
andreser@linux-andreser:~/boringssl$ build/tool/bssl speed -filter 25519
Did 11438 Ed25519 key generation operations in 1041115us (10986.3 ops/sec)
Did 11000 Ed25519 signing operations in 1012589us (10863.2 ops/sec)
Did 3312 Ed25519 verify operations in 1082834us (3058.6 ops/sec)
Did 12000 Curve25519 base-point multiplication operations in 1061318us (11306.7 ops/sec)
Did 3580 Curve25519 arbitrary point multiplication operations in 1004923us (3562.5 ops/sec)
squashed: curve25519: convert field constants to unsigned.
import re, sys, math
def weight(i):
return 2**int(math.ceil(25.5*i))
def convert(t):
limbs = [x for x in t.groups() if x.replace('-','').isdigit()]
v = sum(weight(i)*x for (i,x) in enumerate(map(int, limbs))) % (2**255-19)
limbs = [(v % weight(i+1)) // weight(i) for i in range(10)]
assert v == sum(weight(i)*x for (i,x) in enumerate(limbs))
i = 0
ret = ''
for s in t.groups():
if s.replace('-','').isdigit():
ret += str(limbs[i])
i += 1
else:
ret += s
return ret
fe_re = re.compile(r'(\s*,\s*)'.join(r'(-?\d+)' for i in range(10)))
print (re.sub(fe_re, convert, sys.stdin.read()))
Change-Id: Ibd4f7f5c38e5c4d61c9826afb406baebe2be5168
Reviewed-on: https://boringssl-review.googlesource.com/22385
Reviewed-by: Adam Langley <agl@google.com>
Commit-Queue: Adam Langley <agl@google.com>
CQ-Verified: CQ bot account: commit-bot@chromium.org <commit-bot@chromium.org>
This change doesn't actually introduce any Fiat code yet. It sets up the
directory structure to make the diffs in the next change clearer.
Change-Id: I38a21fb36b18a08b0907f9d37b7ef5d7d3137ede
Reviewed-on: https://boringssl-review.googlesource.com/22624
Reviewed-by: David Benjamin <davidben@google.com>