boringssl/crypto
David Benjamin 35be688078 Enable upstream's ChaCha20 assembly for x86 and ARM (32- and 64-bit).
This removes chacha_vec_arm.S and chacha_vec.c in favor of unifying on
upstream's code. Upstream's is faster and this cuts down on the number of
distinct codepaths. Our old scheme also didn't give vectorized code on
Windows or aarch64.

BoringSSL-specific modifications made to the assembly:

- As usual, the shelling out to $CC is replaced with hardcoding $avx. I've
  tested up to the AVX2 codepath, so enable it all.

- I've removed the AMD XOP code as I have not tested it.

- As usual, the ARM file need the arm_arch.h include tweaked.

Speed numbers follow. We can hope for further wins on these benchmarks after
importing the Poly1305 assembly.

x86
---
Old:
Did 1422000 ChaCha20-Poly1305 (16 bytes) seal operations in 1000433us (1421384.5 ops/sec): 22.7 MB/s
Did 123000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1003803us (122534.0 ops/sec): 165.4 MB/s
Did 22000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1000282us (21993.8 ops/sec): 180.2 MB/s
Did 1428000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000214us (1427694.5 ops/sec): 22.8 MB/s
Did 124000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1006332us (123219.8 ops/sec): 166.3 MB/s
Did 22000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1020771us (21552.3 ops/sec): 176.6 MB/s
New:
Did 1520000 ChaCha20-Poly1305 (16 bytes) seal operations in 1000567us (1519138.6 ops/sec): 24.3 MB/s
Did 152000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1004216us (151361.9 ops/sec): 204.3 MB/s
Did 31000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1009085us (30720.9 ops/sec): 251.7 MB/s
Did 1797000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000141us (1796746.7 ops/sec): 28.7 MB/s
Did 171000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1003204us (170453.9 ops/sec): 230.1 MB/s
Did 31000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1005349us (30835.1 ops/sec): 252.6 MB/s

x86_64, no AVX2
---
Old:
Did 1782000 ChaCha20-Poly1305 (16 bytes) seal operations in 1000204us (1781636.5 ops/sec): 28.5 MB/s
Did 317000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1001579us (316500.2 ops/sec): 427.3 MB/s
Did 62000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1012146us (61256.0 ops/sec): 501.8 MB/s
Did 1778000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000220us (1777608.9 ops/sec): 28.4 MB/s
Did 315000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1002886us (314093.5 ops/sec): 424.0 MB/s
Did 71000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1014606us (69977.9 ops/sec): 573.3 MB/s
New:
Did 1866000 ChaCha20-Poly1305 (16 bytes) seal operations in 1000019us (1865964.5 ops/sec): 29.9 MB/s
Did 399000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1001017us (398594.6 ops/sec): 538.1 MB/s
Did 84000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1005645us (83528.5 ops/sec): 684.3 MB/s
Did 1881000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000325us (1880388.9 ops/sec): 30.1 MB/s
Did 404000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1000004us (403998.4 ops/sec): 545.4 MB/s
Did 85000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1010048us (84154.4 ops/sec): 689.4 MB/s

x86_64, AVX2
---
Old:
Did 2375000 ChaCha20-Poly1305 (16 bytes) seal operations in 1000282us (2374330.4 ops/sec): 38.0 MB/s
Did 448000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1001865us (447166.0 ops/sec): 603.7 MB/s
Did 88000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1005217us (87543.3 ops/sec): 717.2 MB/s
Did 2409000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000188us (2408547.2 ops/sec): 38.5 MB/s
Did 446000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1001003us (445553.1 ops/sec): 601.5 MB/s
Did 90000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1006722us (89399.1 ops/sec): 732.4 MB/s
New:
Did 2622000 ChaCha20-Poly1305 (16 bytes) seal operations in 1000266us (2621302.7 ops/sec): 41.9 MB/s
Did 794000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1000783us (793378.8 ops/sec): 1071.1 MB/s
Did 173000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1000176us (172969.6 ops/sec): 1417.0 MB/s
Did 2623000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000330us (2622134.7 ops/sec): 42.0 MB/s
Did 783000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1000531us (782584.4 ops/sec): 1056.5 MB/s
Did 174000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1000840us (173854.0 ops/sec): 1424.2 MB/s

arm, Nexus 4
---
Old:
Did 388550 ChaCha20-Poly1305 (16 bytes) seal operations in 1000580us (388324.8 ops/sec): 6.2 MB/s
Did 90000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1003816us (89657.9 ops/sec): 121.0 MB/s
Did 19000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1045750us (18168.8 ops/sec): 148.8 MB/s
Did 398500 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000305us (398378.5 ops/sec): 6.4 MB/s
Did 90500 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1000305us (90472.4 ops/sec): 122.1 MB/s
Did 19000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1043278us (18211.8 ops/sec): 149.2 MB/s
New:
Did 424788 ChaCha20-Poly1305 (16 bytes) seal operations in 1000641us (424515.9 ops/sec): 6.8 MB/s
Did 115000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1001526us (114824.8 ops/sec): 155.0 MB/s
Did 27000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1033023us (26136.9 ops/sec): 214.1 MB/s
Did 447750 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000549us (447504.3 ops/sec): 7.2 MB/s
Did 117500 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1001923us (117274.5 ops/sec): 158.3 MB/s
Did 27000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1025118us (26338.4 ops/sec): 215.8 MB/s

aarch64, Nexus 6p
(Note we didn't have aarch64 assembly before at all, and still don't have it
for Poly1305. Hopefully once that's added this will be faster than the arm
numbers...)
---
Old:
Did 145040 ChaCha20-Poly1305 (16 bytes) seal operations in 1003065us (144596.8 ops/sec): 2.3 MB/s
Did 14000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1042605us (13427.9 ops/sec): 18.1 MB/s
Did 2618 ChaCha20-Poly1305 (8192 bytes) seal operations in 1093241us (2394.7 ops/sec): 19.6 MB/s
Did 148000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000709us (147895.1 ops/sec): 2.4 MB/s
Did 14000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1047294us (13367.8 ops/sec): 18.0 MB/s
Did 2607 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1090745us (2390.1 ops/sec): 19.6 MB/s
New:
Did 358000 ChaCha20-Poly1305 (16 bytes) seal operations in 1000769us (357724.9 ops/sec): 5.7 MB/s
Did 45000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1021267us (44062.9 ops/sec): 59.5 MB/s
Did 8591 ChaCha20-Poly1305 (8192 bytes) seal operations in 1047136us (8204.3 ops/sec): 67.2 MB/s
Did 343000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000489us (342832.4 ops/sec): 5.5 MB/s
Did 44000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1008326us (43636.7 ops/sec): 58.9 MB/s
Did 8866 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1083341us (8183.9 ops/sec): 67.0 MB/s

Change-Id: I629fe195d072f2c99e8f947578fad6d70823c4c8
Reviewed-on: https://boringssl-review.googlesource.com/7202
Reviewed-by: Adam Langley <agl@google.com>
2016-02-23 17:19:45 +00:00
..
aes Mark ARM assembly globals hidden uniformly in arm-xlate.pl. 2016-02-11 17:28:03 +00:00
asn1 OpenSSL reformat x509/, x509v3/, pem/ and asn1/. 2016-01-19 17:01:51 +00:00
base64 Remove calls to ERR_load_crypto_strings. 2016-01-25 23:09:08 +00:00
bio Remove CP_UTF8 code for Windows filenames. 2016-02-22 17:19:33 +00:00
bn Consistently use named constants in ARM assembly files. 2016-02-23 17:18:18 +00:00
buf Make |BUF_memdup| look for zero length, not NULL. 2015-10-06 18:11:33 -07:00
bytestring Add a convenience function for i2d compatibility wrappers. 2016-02-16 19:40:53 +00:00
chacha Enable upstream's ChaCha20 assembly for x86 and ARM (32- and 64-bit). 2016-02-23 17:19:45 +00:00
cipher Clarify some confusing casts involving |size_t|. 2016-02-12 15:37:15 +00:00
cmac Add a run_tests target to run all tests. 2015-10-26 20:33:44 +00:00
conf Also add a no-op stub for OPENSSL_config. 2016-01-26 15:48:51 +00:00
curve25519 ed25519: Don't negate output when decoding. 2016-02-16 21:07:44 +00:00
des Use the straight-forward ROTATE macro. 2015-12-16 19:57:31 +00:00
dh Don't cast |OPENSSL_malloc|/|OPENSSL_realloc| result. 2016-02-11 22:07:56 +00:00
digest Remove the arch-specific HOST_c2l/HOST_l2c implementations. 2016-01-27 22:26:32 +00:00
dsa Remove dead header file. 2016-02-17 01:34:15 +00:00
ec Remove flags field from EC_KEY. 2016-02-16 23:51:53 +00:00
ecdh Clean up |ECDH_compute_key|. 2015-10-27 17:00:25 +00:00
ecdsa Add a convenience function for i2d compatibility wrappers. 2016-02-16 19:40:53 +00:00
engine Unwind DH_METHOD and DSA_METHOD. 2015-11-03 22:54:36 +00:00
err Implement new SPKI parsers. 2016-02-17 16:28:07 +00:00
evp Implement new PKCS#8 parsers. 2016-02-17 17:24:10 +00:00
hkdf Remove calls to ERR_load_crypto_strings. 2016-01-25 23:09:08 +00:00
hmac Remove condition which always evaluates to true (size_t >= 0). 2015-11-11 22:20:19 +00:00
lhash Add a run_tests target to run all tests. 2015-10-26 20:33:44 +00:00
md4 Make HOST_l2c return void. 2015-12-16 20:02:37 +00:00
md5 Make HOST_l2c return void. 2015-12-16 20:02:37 +00:00
modes Restore |xmm7| correctly on Win64 in aesni-gcm-x86_64. 2016-02-18 15:50:46 +00:00
obj Allocate a NID for X25519. 2015-12-22 18:56:53 +00:00
pem Remove param_decode and param_encode EVP_PKEY hooks. 2016-02-17 16:30:29 +00:00
perlasm Mark ARM assembly globals hidden uniformly in arm-xlate.pl. 2016-02-11 17:28:03 +00:00
pkcs8 Implement new PKCS#8 parsers. 2016-02-17 17:24:10 +00:00
poly1305 Fix |-Werror=old-style-declaration| violations in poly1305_vec.c. 2016-01-28 23:58:45 +00:00
rand Add a few more no-op stubs for cURL compatibility. 2016-01-26 15:48:41 +00:00
rc4 Remove the stitched RC4-MD5 code and use the generic one. 2015-12-16 23:57:42 +00:00
rsa Add missing " in comment. 2016-02-17 21:17:26 +00:00
sha Consistently use named constants in ARM assembly files. 2016-02-23 17:18:18 +00:00
stack Move arm_arch.h and fix up lots of include paths. 2015-08-26 01:57:59 +00:00
test Remove support for blocks in file_test.h. 2016-02-17 17:24:57 +00:00
x509 Slightly simplify and deprecate i2d_{Public,Private}Key. 2016-02-17 16:31:26 +00:00
x509v3 OpenSSL reformat x509/, x509v3/, pem/ and asn1/. 2016-01-19 17:01:51 +00:00
CMakeLists.txt Add X25519 and Ed25519 support. 2015-11-17 21:56:12 +00:00
constant_time_test.c
cpu-arm-asm.S
cpu-arm.c Allow |CRYPTO_is_NEON_capable| to be known at compile time, if possible. 2015-11-19 00:15:11 +00:00
cpu-intel.c Fix |sscanf| format string in cpu-intel.c. 2016-01-21 20:59:35 +00:00
crypto.c Add a few more no-op stubs for cURL compatibility. 2016-01-26 15:48:41 +00:00
directory_posix.c
directory_win.c
directory.h
ex_data.c Skip free callbacks on empty CRYPTO_EX_DATAs. 2015-12-15 21:32:14 +00:00
internal.h Fix 32-bit build. 2016-01-27 22:29:52 +00:00
mem.c Fix some indentation. 2016-01-28 00:51:45 +00:00
refcount_c11.c Cast refcounts to _Atomic before use. 2015-05-20 13:39:22 -07:00
refcount_lock.c
refcount_test.c
thread_none.c
thread_pthread.c Make sure pthread_once() succeeds. 2015-11-17 21:44:40 +00:00
thread_test.c
thread_win.c Fix data <-> function pointer casts in thread_win.c. 2016-01-27 22:08:26 +00:00
thread.c
time_support.c Remove some mingw support cruft. 2016-01-25 23:05:45 +00:00