boringssl/crypto
David Benjamin 32ce6032ff Add an optimized x86_64 vpaes ctr128_f and remove bsaes.
Brian Smith suggested applying vpaes-armv8's "2x" optimization to
vpaes-x86_64. The registers are a little tight (aarch64 has a whole 32
SIMD registers, while x86_64 only has 16), but it's doable with some
spills and makes vpaes much more competitive with bsaes. At small- and
medium-sized inputs, vpaes now matches bsaes. At large inputs, it's a
~10% perf hit.

bsaes is thus pulling much less weight. Losing an entire AES
implementation and having constant-time AES for SSSE3 is attractive.
Some notes:

- The fact that these are older CPUs tempers the perf hit, but CPUs
  without AES-NI are still common enough to matter.

- This CL does regress CBC decrypt performance nontrivially (see below).
  If this matters, we can double-up CBC decryption too. CBC in TLS is
  legacy and already pays a costly Lucky13 mitigation.

- The difference between 1350 and 8192 bytes is likely bsaes AES-GCM
  paying for two slow (and variable-time!) aes_nohw_encrypt
  calls for EK0 and the trailing partial block. At larger inputs, those
  two calls are more amortized.

- To that end, bsaes would likely be much faster on AES-GCM with smarter
  use of bsaes. (Fold one-off calls above into bulk data.) Implementing
  this is a bit of a nuisance though, especially considering we don't
  wish to regress hwaes.

- I'd discarded the key conversion idea, but I think I did it wrong.
  Benchmarks from
  https://boringssl-review.googlesource.com/c/boringssl/+/33589 suggest
  converting to bsaes format on-demand for large ctr32 inputs should
  give the best of both worlds, but at the cost of an entire AES
  implementation relative to this CL.

- ARMv7 still depends on bsaes and has no vpaes. It also has 16 SIMD
  registers, so my plan is to translate it, with the same 2x
  optimization, and see how it compares. Hopefully that, or some
  combination of the above, will work for ARMv7.

Sandy Bridge
bsaes (before):
Did 3144750 AES-128-GCM (16 bytes) seal operations in 5016000us (626943.8 ops/sec): 10.0 MB/s
Did 2053750 AES-128-GCM (256 bytes) seal operations in 5016000us (409439.8 ops/sec): 104.8 MB/s
Did 469000 AES-128-GCM (1350 bytes) seal operations in 5015000us (93519.4 ops/sec): 126.3 MB/s
Did 92500 AES-128-GCM (8192 bytes) seal operations in 5016000us (18441.0 ops/sec): 151.1 MB/s
Did 46750 AES-128-GCM (16384 bytes) seal operations in 5032000us (9290.5 ops/sec): 152.2 MB/s
vpaes-1x (for reference, not this CL):
Did 8684750 AES-128-GCM (16 bytes) seal operations in 5015000us (1731754.7 ops/sec): 27.7 MB/s [+177%]
Did 1731500 AES-128-GCM (256 bytes) seal operations in 5016000us (345195.4 ops/sec): 88.4 MB/s [-15.6%]
Did 346500 AES-128-GCM (1350 bytes) seal operations in 5016000us (69078.9 ops/sec): 93.3 MB/s [-26.1%]
Did 61250 AES-128-GCM (8192 bytes) seal operations in 5015000us (12213.4 ops/sec): 100.1 MB/s [-33.8%]
Did 32500 AES-128-GCM (16384 bytes) seal operations in 5031000us (6459.9 ops/sec): 105.8 MB/s [-30.5%]
vpaes-2x (this CL):
Did 8840000 AES-128-GCM (16 bytes) seal operations in 5015000us (1762711.9 ops/sec): 28.2 MB/s [+182%]
Did 2167750 AES-128-GCM (256 bytes) seal operations in 5016000us (432167.1 ops/sec): 110.6 MB/s [+5.5%]
Did 474000 AES-128-GCM (1350 bytes) seal operations in 5016000us (94497.6 ops/sec): 127.6 MB/s [+1.0%]
Did 81750 AES-128-GCM (8192 bytes) seal operations in 5015000us (16301.1 ops/sec): 133.5 MB/s [-11.6%]
Did 41750 AES-128-GCM (16384 bytes) seal operations in 5031000us (8298.5 ops/sec): 136.0 MB/s [-10.6%]

Penryn
bsaes (before):
Did 958000 AES-128-GCM (16 bytes) seal operations in 1000264us (957747.2 ops/sec): 15.3 MB/s
Did 420000 AES-128-GCM (256 bytes) seal operations in 1000480us (419798.5 ops/sec): 107.5 MB/s
Did 96000 AES-128-GCM (1350 bytes) seal operations in 1001083us (95896.1 ops/sec): 129.5 MB/s
Did 18000 AES-128-GCM (8192 bytes) seal operations in 1042491us (17266.3 ops/sec): 141.4 MB/s
Did 9482 AES-128-GCM (16384 bytes) seal operations in 1095703us (8653.8 ops/sec): 141.8 MB/s
Did 758000 AES-256-GCM (16 bytes) seal operations in 1000769us (757417.5 ops/sec): 12.1 MB/s
Did 359000 AES-256-GCM (256 bytes) seal operations in 1001993us (358285.9 ops/sec): 91.7 MB/s
Did 82000 AES-256-GCM (1350 bytes) seal operations in 1009583us (81221.7 ops/sec): 109.6 MB/s
Did 15000 AES-256-GCM (8192 bytes) seal operations in 1022294us (14672.9 ops/sec): 120.2 MB/s
Did 7884 AES-256-GCM (16384 bytes) seal operations in 1070934us (7361.8 ops/sec): 120.6 MB/s
vpaes-1x (for reference, not this CL):
Did 2030000 AES-128-GCM (16 bytes) seal operations in 1000227us (2029539.3 ops/sec): 32.5 MB/s [+112%]
Did 382000 AES-128-GCM (256 bytes) seal operations in 1001949us (381256.9 ops/sec): 97.6 MB/s [-9.2%]
Did 81000 AES-128-GCM (1350 bytes) seal operations in 1007297us (80413.2 ops/sec): 108.6 MB/s [-16.1%]
Did 14000 AES-128-GCM (8192 bytes) seal operations in 1031499us (13572.5 ops/sec): 111.2 MB/s [-21.4%]
Did 7008 AES-128-GCM (16384 bytes) seal operations in 1030706us (6799.2 ops/sec): 111.4 MB/s [-21.4%]
Did 1838000 AES-256-GCM (16 bytes) seal operations in 1000238us (1837562.7 ops/sec): 29.4 MB/s [+143%]
Did 321000 AES-256-GCM (256 bytes) seal operations in 1001666us (320466.1 ops/sec): 82.0 MB/s [-10.6%]
Did 67000 AES-256-GCM (1350 bytes) seal operations in 1010359us (66313.1 ops/sec): 89.5 MB/s [-18.3%]
Did 12000 AES-256-GCM (8192 bytes) seal operations in 1072706us (11186.7 ops/sec): 91.6 MB/s [-23.8%]
Did 5680 AES-256-GCM (16384 bytes) seal operations in 1009214us (5628.1 ops/sec): 92.2 MB/s [-23.5%]
vpaes-2x (this CL):
Did 2072000 AES-128-GCM (16 bytes) seal operations in 1000066us (2071863.3 ops/sec): 33.1 MB/s [+116%]
Did 432000 AES-128-GCM (256 bytes) seal operations in 1000732us (431684.0 ops/sec): 110.5 MB/s [+2.8%]
Did 92000 AES-128-GCM (1350 bytes) seal operations in 1000580us (91946.7 ops/sec): 124.1 MB/s [-4.2%]
Did 16000 AES-128-GCM (8192 bytes) seal operations in 1016422us (15741.5 ops/sec): 129.0 MB/s [-8.8%]
Did 8448 AES-128-GCM (16384 bytes) seal operations in 1073962us (7866.2 ops/sec): 128.9 MB/s [-9.1%]
Did 1865000 AES-256-GCM (16 bytes) seal operations in 1000043us (1864919.8 ops/sec): 29.8 MB/s [+146%]
Did 364000 AES-256-GCM (256 bytes) seal operations in 1001561us (363432.7 ops/sec): 93.0 MB/s [+1.4%]
Did 77000 AES-256-GCM (1350 bytes) seal operations in 1004123us (76683.8 ops/sec): 103.5 MB/s [-5.6%]
Did 14000 AES-256-GCM (8192 bytes) seal operations in 1071179us (13069.7 ops/sec): 107.1 MB/s [-10.9%]
Did 7008 AES-256-GCM (16384 bytes) seal operations in 1074125us (6524.4 ops/sec): 106.9 MB/s [-11.4%]

Penryn, CBC mode decryption
bsaes (before):
Did 159000 AES-128-CBC-SHA1 (16 bytes) open operations in 1001019us (158838.1 ops/sec): 2.5 MB/s
Did 114000 AES-128-CBC-SHA1 (256 bytes) open operations in 1006485us (113265.5 ops/sec): 29.0 MB/s
Did 65000 AES-128-CBC-SHA1 (1350 bytes) open operations in 1008441us (64455.9 ops/sec): 87.0 MB/s
Did 17000 AES-128-CBC-SHA1 (8192 bytes) open operations in 1005440us (16908.0 ops/sec): 138.5 MB/s
vpaes (after):
Did 167000 AES-128-CBC-SHA1 (16 bytes) open operations in 1003556us (166408.3 ops/sec): 2.7 MB/s [+8%]
Did 112000 AES-128-CBC-SHA1 (256 bytes) open operations in 1005673us (111368.2 ops/sec): 28.5 MB/s [-1.7%]
Did 56000 AES-128-CBC-SHA1 (1350 bytes) open operations in 1005647us (55685.5 ops/sec): 75.2 MB/s [-13.6%]
Did 13635 AES-128-CBC-SHA1 (8192 bytes) open operations in 1020486us (13361.3 ops/sec): 109.5 MB/s [-20.9%]

Bug: 256
Change-Id: I11ed773323ec7a5ee61080c9ed9ed4761849828a
Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/35364
Commit-Queue: David Benjamin <davidben@google.com>
Reviewed-by: Adam Langley <agl@google.com>
2019-03-23 06:59:22 +00:00
..
asn1 Reject long inputs in c2i_ASN1_INTEGER. 2019-03-18 17:19:52 +00:00
base64 Modernize OPENSSL_COMPILE_ASSERT, part 2. 2018-11-14 16:06:37 +00:00
bio Fix d2i_*_bio on partial reads. 2018-12-05 22:05:28 +00:00
bn_extra Add some Node compatibility functions. 2019-01-25 16:50:30 +00:00
buf Flatten most of the crypto target. 2018-09-05 23:41:25 +00:00
bytestring Add uint64_t support in CBS and CBB. 2019-02-22 20:38:17 +00:00
chacha Add ABI tests for ChaCha20_ctr32. 2019-01-09 03:11:45 +00:00
cipher_extra Prefer vpaes over bsaes in AES-GCM-SIV and AES-CCM. 2019-03-05 17:55:03 +00:00
cmac Flatten most of the crypto target. 2018-09-05 23:41:25 +00:00
conf Use proper functions for lh_*. 2018-10-15 23:37:04 +00:00
curve25519 Automatically disable assembly with MSAN. 2018-09-07 21:12:37 +00:00
dh Flatten most of the crypto target. 2018-09-05 23:41:25 +00:00
digest_extra Fix undefined pointer casts in SHA-512 code. 2019-01-22 23:18:36 +00:00
dsa Tidy up dsa_sign_setup. 2018-10-25 21:51:57 +00:00
ec_extra Use EC_RAW_POINT in ECDSA. 2018-11-13 02:06:46 +00:00
ecdh_extra Clean up EC_POINT to byte conversions. 2018-11-13 17:27:59 +00:00
ecdsa_extra Remove unreachable code. 2018-11-12 23:34:36 +00:00
engine Flatten most of the crypto target. 2018-09-05 23:41:25 +00:00
err Enforce key usage for RSA keys in TLS 1.2. 2019-01-30 21:28:34 +00:00
evp Add a very roundabout EC keygen API. 2019-01-25 23:08:12 +00:00
fipsmodule Add an optimized x86_64 vpaes ctr128_f and remove bsaes. 2019-03-23 06:59:22 +00:00
hkdf Flatten most of the crypto target. 2018-09-05 23:41:25 +00:00
hmac_extra Convert a number of tests to GTest. 2017-06-01 17:02:13 +00:00
hrss HRSS: flatten sample distribution. 2019-01-22 22:06:43 +00:00
lhash Fix undefined function pointer casts in LHASH. 2018-10-15 23:53:24 +00:00
obj Add initial HRSS support. 2018-12-12 17:35:02 +00:00
pem Rewrite PEM_X509_INFO_read_bio. 2018-10-01 17:35:10 +00:00
perlasm Fix x86_64-xlate.pl comment regex. 2019-02-21 16:50:17 +00:00
pkcs7 Fix undefined function pointer casts in {d2i,i2d}_Foo_{bio,fp} 2018-10-01 17:34:53 +00:00
pkcs8 Fix undefined function pointer casts in {d2i,i2d}_Foo_{bio,fp} 2018-10-01 17:34:53 +00:00
poly1305 Automatically disable assembly with MSAN. 2018-09-07 21:12:37 +00:00
pool Clear out a bunch of -Wextra-semi warnings. 2019-02-21 19:12:39 +00:00
rand_extra Unwind RDRAND functions correctly on Windows. 2019-02-12 20:24:27 +00:00
rc4 Flatten most of the crypto target. 2018-09-05 23:41:25 +00:00
rsa_extra Rename OPENSSL_NO_THREADS, part 1. 2018-09-26 19:10:02 +00:00
stack Implement sk_find manually. 2019-03-14 15:21:48 +00:00
test Add a reference for Linux ARM ABI. 2019-02-27 17:18:02 +00:00
x509 Fix d2i_*_bio on partial reads. 2018-12-05 22:05:28 +00:00
x509v3 Unexport and rename hex_to_string, string_to_hex, and name_cmp. 2018-11-27 00:08:39 +00:00
abi_self_test.cc Use Windows symbol APIs in the unwind tester. 2019-02-12 20:42:47 +00:00
CMakeLists.txt Implement ABI testing for aarch64. 2019-02-05 21:44:04 +00:00
compiler_test.cc Test that nullptr has the obvious memory representation. 2017-07-28 17:39:28 +00:00
constant_time_test.cc Add a test for CRYPTO_memcmp. 2018-03-27 16:22:47 +00:00
cpu-aarch64-fuchsia.c Add cpu-aarch64-fuchsia.c 2018-02-13 20:12:47 +00:00
cpu-aarch64-linux.c Add cpu-aarch64-fuchsia.c 2018-02-13 20:12:47 +00:00
cpu-arm-linux_test.cc Move ARM cpuinfo functions to the header. 2018-11-21 00:46:57 +00:00
cpu-arm-linux.c Move ARM cpuinfo functions to the header. 2018-11-21 00:46:57 +00:00
cpu-arm-linux.h Move ARM cpuinfo functions to the header. 2018-11-21 00:46:57 +00:00
cpu-arm.c Rewrite ARM feature detection. 2016-03-26 04:54:44 +00:00
cpu-intel.c Pretend AMD XOP was never a thing. 2018-12-03 22:59:55 +00:00
cpu-ppc64le.c Run the comment converter on libcrypto. 2017-08-18 21:49:04 +00:00
crypto.c Add test of assembly code dispatch. 2019-01-22 20:22:53 +00:00
ex_data.c Unexport more of lhash. 2017-10-25 04:17:18 +00:00
impl_dispatch_test.cc Enable vpaes for AES_* functions. 2019-02-22 23:09:19 +00:00
internal.h Fix header file for _byteswap_ulong and _byteswap_uint64 from MSVC CRT 2019-01-14 19:49:39 +00:00
mem.c Tell ASan about the OPENSSL_malloc prefix. 2019-03-05 17:53:16 +00:00
refcount_c11.c Run the comment converter on libcrypto. 2017-08-18 21:49:04 +00:00
refcount_lock.c Modernize OPENSSL_COMPILE_ASSERT, part 2. 2018-11-14 16:06:37 +00:00
refcount_test.cc Rename OPENSSL_NO_THREADS, part 1. 2018-09-26 19:10:02 +00:00
self_test.cc Extract FIPS KAT tests into a function. 2018-01-22 20:16:38 +00:00
thread_none.c Rename OPENSSL_NO_THREADS, part 1. 2018-09-26 19:10:02 +00:00
thread_pthread.c Modernize OPENSSL_COMPILE_ASSERT, part 2. 2018-11-14 16:06:37 +00:00
thread_test.cc Rename OPENSSL_NO_THREADS, part 1. 2018-09-26 19:10:02 +00:00
thread_win.c Replace the last CRITICAL_SECTION with SRWLOCK. 2018-12-03 20:37:35 +00:00
thread.c Remove a bunch of unnecessary includes. 2016-06-28 20:31:14 +00:00