Enable upstream's ChaCha20 assembly for x86 and ARM (32- and 64-bit).

This removes chacha_vec_arm.S and chacha_vec.c in favor of unifying on upstream's code. Upstream's is faster and this cuts down on the number of distinct codepaths. Our old scheme also didn't give vectorized code on Windows or aarch64. BoringSSL-specific modifications made to the assembly: - As usual, the shelling out to $CC is replaced with hardcoding $avx. I've tested up to the AVX2 codepath, so enable it all. - I've removed the AMD XOP code as I have not tested it. - As usual, the ARM file need the arm_arch.h include tweaked. Speed numbers follow. We can hope for further wins on these benchmarks after importing the Poly1305 assembly. x86 --- Old: Did 1422000 ChaCha20-Poly1305 (16 bytes) seal operations in 1000433us (1421384.5 ops/sec): 22.7 MB/s Did 123000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1003803us (122534.0 ops/sec): 165.4 MB/s Did 22000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1000282us (21993.8 ops/sec): 180.2 MB/s Did 1428000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000214us (1427694.5 ops/sec): 22.8 MB/s Did 124000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1006332us (123219.8 ops/sec): 166.3 MB/s Did 22000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1020771us (21552.3 ops/sec): 176.6 MB/s New: Did 1520000 ChaCha20-Poly1305 (16 bytes) seal operations in 1000567us (1519138.6 ops/sec): 24.3 MB/s Did 152000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1004216us (151361.9 ops/sec): 204.3 MB/s Did 31000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1009085us (30720.9 ops/sec): 251.7 MB/s Did 1797000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000141us (1796746.7 ops/sec): 28.7 MB/s Did 171000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1003204us (170453.9 ops/sec): 230.1 MB/s Did 31000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1005349us (30835.1 ops/sec): 252.6 MB/s x86_64, no AVX2 --- Old: Did 1782000 ChaCha20-Poly1305 (16 bytes) seal operations in 1000204us (1781636.5 ops/sec): 28.5 MB/s Did 317000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1001579us (316500.2 ops/sec): 427.3 MB/s Did 62000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1012146us (61256.0 ops/sec): 501.8 MB/s Did 1778000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000220us (1777608.9 ops/sec): 28.4 MB/s Did 315000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1002886us (314093.5 ops/sec): 424.0 MB/s Did 71000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1014606us (69977.9 ops/sec): 573.3 MB/s New: Did 1866000 ChaCha20-Poly1305 (16 bytes) seal operations in 1000019us (1865964.5 ops/sec): 29.9 MB/s Did 399000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1001017us (398594.6 ops/sec): 538.1 MB/s Did 84000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1005645us (83528.5 ops/sec): 684.3 MB/s Did 1881000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000325us (1880388.9 ops/sec): 30.1 MB/s Did 404000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1000004us (403998.4 ops/sec): 545.4 MB/s Did 85000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1010048us (84154.4 ops/sec): 689.4 MB/s x86_64, AVX2 --- Old: Did 2375000 ChaCha20-Poly1305 (16 bytes) seal operations in 1000282us (2374330.4 ops/sec): 38.0 MB/s Did 448000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1001865us (447166.0 ops/sec): 603.7 MB/s Did 88000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1005217us (87543.3 ops/sec): 717.2 MB/s Did 2409000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000188us (2408547.2 ops/sec): 38.5 MB/s Did 446000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1001003us (445553.1 ops/sec): 601.5 MB/s Did 90000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1006722us (89399.1 ops/sec): 732.4 MB/s New: Did 2622000 ChaCha20-Poly1305 (16 bytes) seal operations in 1000266us (2621302.7 ops/sec): 41.9 MB/s Did 794000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1000783us (793378.8 ops/sec): 1071.1 MB/s Did 173000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1000176us (172969.6 ops/sec): 1417.0 MB/s Did 2623000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000330us (2622134.7 ops/sec): 42.0 MB/s Did 783000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1000531us (782584.4 ops/sec): 1056.5 MB/s Did 174000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1000840us (173854.0 ops/sec): 1424.2 MB/s arm, Nexus 4 --- Old: Did 388550 ChaCha20-Poly1305 (16 bytes) seal operations in 1000580us (388324.8 ops/sec): 6.2 MB/s Did 90000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1003816us (89657.9 ops/sec): 121.0 MB/s Did 19000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1045750us (18168.8 ops/sec): 148.8 MB/s Did 398500 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000305us (398378.5 ops/sec): 6.4 MB/s Did 90500 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1000305us (90472.4 ops/sec): 122.1 MB/s Did 19000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1043278us (18211.8 ops/sec): 149.2 MB/s New: Did 424788 ChaCha20-Poly1305 (16 bytes) seal operations in 1000641us (424515.9 ops/sec): 6.8 MB/s Did 115000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1001526us (114824.8 ops/sec): 155.0 MB/s Did 27000 ChaCha20-Poly1305 (8192 bytes) seal operations in 1033023us (26136.9 ops/sec): 214.1 MB/s Did 447750 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000549us (447504.3 ops/sec): 7.2 MB/s Did 117500 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1001923us (117274.5 ops/sec): 158.3 MB/s Did 27000 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1025118us (26338.4 ops/sec): 215.8 MB/s aarch64, Nexus 6p (Note we didn't have aarch64 assembly before at all, and still don't have it for Poly1305. Hopefully once that's added this will be faster than the arm numbers...) --- Old: Did 145040 ChaCha20-Poly1305 (16 bytes) seal operations in 1003065us (144596.8 ops/sec): 2.3 MB/s Did 14000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1042605us (13427.9 ops/sec): 18.1 MB/s Did 2618 ChaCha20-Poly1305 (8192 bytes) seal operations in 1093241us (2394.7 ops/sec): 19.6 MB/s Did 148000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000709us (147895.1 ops/sec): 2.4 MB/s Did 14000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1047294us (13367.8 ops/sec): 18.0 MB/s Did 2607 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1090745us (2390.1 ops/sec): 19.6 MB/s New: Did 358000 ChaCha20-Poly1305 (16 bytes) seal operations in 1000769us (357724.9 ops/sec): 5.7 MB/s Did 45000 ChaCha20-Poly1305 (1350 bytes) seal operations in 1021267us (44062.9 ops/sec): 59.5 MB/s Did 8591 ChaCha20-Poly1305 (8192 bytes) seal operations in 1047136us (8204.3 ops/sec): 67.2 MB/s Did 343000 ChaCha20-Poly1305-Old (16 bytes) seal operations in 1000489us (342832.4 ops/sec): 5.5 MB/s Did 44000 ChaCha20-Poly1305-Old (1350 bytes) seal operations in 1008326us (43636.7 ops/sec): 58.9 MB/s Did 8866 ChaCha20-Poly1305-Old (8192 bytes) seal operations in 1083341us (8183.9 ops/sec): 67.0 MB/s Change-Id: I629fe195d072f2c99e8f947578fad6d70823c4c8 Reviewed-on: https://boringssl-review.googlesource.com/7202 Reviewed-by: Adam Langley <agl@google.com>
8 years ago · 35be688078
--- a/BUILDING.md
+++ b/BUILDING.md
@@ -31,16 +31,6 @@
  * [Go](https://golang.org/dl/) is required. If not found by CMake, the go
    executable may be configured explicitly by setting `GO_EXECUTABLE`.

  * If you change crypto/chacha/chacha\_vec.c, you will need the
    arm-linux-gnueabihf-gcc compiler:

    ```
    wget https://releases.linaro.org/14.11/components/toolchain/binaries/arm-linux-gnueabihf/gcc-linaro-4.9-2014.11-x86_64_arm-linux-gnueabihf.tar.xz && \
    echo bc4ca2ced084d2dc12424815a4442e19cb1422db87068830305d90075feb1a3b  gcc-linaro-4.9-2014.11-x86_64_arm-linux-gnueabihf.tar.xz | sha256sum -c && \
    tar xf gcc-linaro-4.9-2014.11-x86_64_arm-linux-gnueabihf.tar.xz && \
    sudo mv gcc-linaro-4.9-2014.11-x86_64_arm-linux-gnueabihf /opt/
    ```

 ## Building

 Using Ninja (note the 'N' is capitalized in the cmake invocation):
--- a/crypto/chacha/CMakeLists.txt
+++ b/crypto/chacha/CMakeLists.txt
@@ -4,7 +4,31 @@ if (${ARCH} STREQUAL "arm")
  set(
    CHACHA_ARCH_SOURCES

    chacha_vec_arm.S
    chacha-armv4.${ASM_EXT}
  )
 endif()

 if (${ARCH} STREQUAL "aarch64")
  set(
    CHACHA_ARCH_SOURCES

    chacha-armv8.${ASM_EXT}
  )
 endif()

 if (${ARCH} STREQUAL "x86")
  set(
    CHACHA_ARCH_SOURCES

    chacha-x86.${ASM_EXT}
  )
 endif()

 if (${ARCH} STREQUAL "x86_64")
  set(
    CHACHA_ARCH_SOURCES

    chacha-x86_64.${ASM_EXT}
  )
 endif()

@@ -13,8 +37,12 @@ add_library(

  OBJECT

  chacha_generic.c
  chacha_vec.c
  chacha.c

  ${CHACHA_ARCH_SOURCES}
 )

 perlasm(chacha-armv4.${ASM_EXT} asm/chacha-armv4.pl)
 perlasm(chacha-armv8.${ASM_EXT} asm/chacha-armv8.pl)
 perlasm(chacha-x86.${ASM_EXT} asm/chacha-x86.pl)
 perlasm(chacha-x86_64.${ASM_EXT} asm/chacha-x86_64.pl)
--- a/crypto/chacha/asm/chacha-armv4.pl
+++ b/crypto/chacha/asm/chacha-armv4.pl
@@ -162,7 +162,7 @@ my @ret;
 }

 $code.=<<___;
 #include "arm_arch.h"
 #include <openssl/arm_arch.h>

 .text
 #if defined(__thumb2__)
--- a/crypto/chacha/asm/chacha-armv8.pl
+++ b/crypto/chacha/asm/chacha-armv8.pl
@@ -111,7 +111,7 @@ my ($a3,$b3,$c3,$d3)=map(($_&~3)+(($_+1)&3),($a2,$b2,$c2,$d2));
 }

 $code.=<<___;
 #include "arm_arch.h"
 #include <openssl/arm_arch.h>

 .text

--- a/crypto/chacha/asm/chacha-x86.pl
+++ b/crypto/chacha/asm/chacha-x86.pl
@@ -26,6 +26,8 @@
 # Bulldozer	13.4/+50%	4.38(*)
 #
 # (*)	Bulldozer actually executes 4xXOP code path that delivers 3.55;
 #
 # Modified from upstream OpenSSL to remove the XOP code.

 $0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
 push(@INC,"${dir}","${dir}../../perlasm");
@@ -36,22 +38,7 @@ require "x86asm.pl";
 $xmm=$ymm=0;
 for (@ARGV) { $xmm=1 if (/-DOPENSSL_IA32_SSE2/); }

 $ymm=1 if ($xmm &&
 		`$ENV{CC} -Wa,-v -c -o /dev/null -x assembler /dev/null 2>&1`
 			=~ /GNU assembler version ([2-9]\.[0-9]+)/ &&
 		$1>=2.19);	# first version supporting AVX

 $ymm=1 if ($xmm && !$ymm && $ARGV[0] eq "win32n" &&
 		`nasm -v 2>&1` =~ /NASM version ([2-9]\.[0-9]+)/ &&
 		$1>=2.03);	# first version supporting AVX

 $ymm=1 if ($xmm && !$ymm && $ARGV[0] eq "win32" &&
 		`ml 2>&1` =~ /Version ([0-9]+)\./ &&
 		$1>=10);	# first version supporting AVX

 $ymm=1 if ($xmm && !$ymm &&
 		`$ENV{CC} -v 2>&1` =~ /(^clang version|based on LLVM) ([3-9]\.[0-9]+)/ &&
 		$2>=3.0);	# first version supporting AVX
 $ymm=$xmm;

 $a="eax";
 ($b,$b_)=("ebx","ebp");
@@ -118,7 +105,6 @@ my ($ap,$bp,$cp,$dp)=map(($_&~3)+(($_-1)&3),($ai,$bi,$ci,$di));	# previous
 }

 &static_label("ssse3_shortcut");
 &static_label("xop_shortcut");
 &static_label("ssse3_data");
 &static_label("pic_point");

@@ -434,9 +420,6 @@ my ($ap,$bp,$cp,$dp)=map(($_&~3)+(($_-1)&3),($ai,$bi,$ci,$di));	# previous

 &function_begin("ChaCha20_ssse3");
 &set_label("ssse3_shortcut");
 	&test		(&DWP(4,"ebp"),1<<11);		# test XOP bit
 	&jnz		(&label("xop_shortcut"));

 	&mov		($out,&wparam(0));
 	&mov		($inp,&wparam(1));
 	&mov		($len,&wparam(2));
@@ -767,366 +750,4 @@ sub SSSE3ROUND {	# critical path is 20 "SIMD ticks" per round
 }
 &asciz	("ChaCha20 for x86, CRYPTOGAMS by <appro\@openssl.org>");

 if ($xmm) {
 my ($xa,$xa_,$xb,$xb_,$xc,$xc_,$xd,$xd_)=map("xmm$_",(0..7));
 my ($out,$inp,$len)=("edi","esi","ecx");

 sub QUARTERROUND_XOP {
 my ($ai,$bi,$ci,$di,$i)=@_;
 my ($an,$bn,$cn,$dn)=map(($_&~3)+(($_+1)&3),($ai,$bi,$ci,$di));	# next
 my ($ap,$bp,$cp,$dp)=map(($_&~3)+(($_-1)&3),($ai,$bi,$ci,$di));	# previous

 	#       a   b   c   d
 	#
 	#       0   4   8  12 < even round
 	#       1   5   9  13
 	#       2   6  10  14
 	#       3   7  11  15
 	#       0   5  10  15 < odd round
 	#       1   6  11  12
 	#       2   7   8  13
 	#       3   4   9  14

 	if ($i==0) {
            my $j=4;
 	    ($ap,$bp,$cp,$dp)=map(($_&~3)+(($_-$j--)&3),($ap,$bp,$cp,$dp));
 	} elsif ($i==3) {
            my $j=0;
 	    ($an,$bn,$cn,$dn)=map(($_&~3)+(($_+$j++)&3),($an,$bn,$cn,$dn));
 	} elsif ($i==4) {
            my $j=4;
 	    ($ap,$bp,$cp,$dp)=map(($_&~3)+(($_+$j--)&3),($ap,$bp,$cp,$dp));
 	} elsif ($i==7) {
            my $j=0;
 	    ($an,$bn,$cn,$dn)=map(($_&~3)+(($_-$j++)&3),($an,$bn,$cn,$dn));
 	}

 	#&vpaddd	($xa,$xa,$xb);			# see elsewhere
 	#&vpxor		($xd,$xd,$xa);			# see elsewhere
 	 &vmovdqa	(&QWP(16*$cp-128,"ebx"),$xc_)	if ($ai>0 && $ai<3);
 	&vprotd		($xd,$xd,16);
 	 &vmovdqa	(&QWP(16*$bp-128,"ebx"),$xb_)	if ($i!=0);
 	&vpaddd		($xc,$xc,$xd);
 	 &vmovdqa	($xc_,&QWP(16*$cn-128,"ebx"))	if ($ai>0 && $ai<3);
 	&vpxor		($xb,$i!=0?$xb:$xb_,$xc);
 	 &vmovdqa	($xa_,&QWP(16*$an-128,"ebx"));
 	&vprotd		($xb,$xb,12);
 	 &vmovdqa	($xb_,&QWP(16*$bn-128,"ebx"))	if ($i<7);
 	&vpaddd		($xa,$xa,$xb);
 	 &vmovdqa	($xd_,&QWP(16*$dn-128,"ebx"))	if ($di!=$dn);
 	&vpxor		($xd,$xd,$xa);
 	 &vpaddd	($xa_,$xa_,$xb_)		if ($i<7);	# elsewhere
 	&vprotd		($xd,$xd,8);
 	&vmovdqa	(&QWP(16*$ai-128,"ebx"),$xa);
 	&vpaddd		($xc,$xc,$xd);
 	&vmovdqa	(&QWP(16*$di-128,"ebx"),$xd)	if ($di!=$dn);
 	&vpxor		($xb,$xb,$xc);
 	 &vpxor		($xd_,$di==$dn?$xd:$xd_,$xa_)	if ($i<7);	# elsewhere
 	&vprotd		($xb,$xb,7);

 	($xa,$xa_)=($xa_,$xa);
 	($xb,$xb_)=($xb_,$xb);
 	($xc,$xc_)=($xc_,$xc);
 	($xd,$xd_)=($xd_,$xd);
 }

 &function_begin("ChaCha20_xop");
 &set_label("xop_shortcut");
 	&mov		($out,&wparam(0));
 	&mov		($inp,&wparam(1));
 	&mov		($len,&wparam(2));
 	&mov		("edx",&wparam(3));		# key
 	&mov		("ebx",&wparam(4));		# counter and nonce
 	&vzeroupper	();

 	&mov		("ebp","esp");
 	&stack_push	(131);
 	&and		("esp",-64);
 	&mov		(&DWP(512,"esp"),"ebp");

 	&lea		("eax",&DWP(&label("ssse3_data")."-".
 				    &label("pic_point"),"eax"));
 	&vmovdqu	("xmm3",&QWP(0,"ebx"));		# counter and nonce

 	&cmp		($len,64*4);
 	&jb		(&label("1x"));

 	&mov		(&DWP(512+4,"esp"),"edx");	# offload pointers
 	&mov		(&DWP(512+8,"esp"),"ebx");
 	&sub		($len,64*4);			# bias len
 	&lea		("ebp",&DWP(256+128,"esp"));	# size optimization

 	&vmovdqu	("xmm7",&QWP(0,"edx"));		# key
 	&vpshufd	("xmm0","xmm3",0x00);
 	&vpshufd	("xmm1","xmm3",0x55);
 	&vpshufd	("xmm2","xmm3",0xaa);
 	&vpshufd	("xmm3","xmm3",0xff);
 	 &vpaddd	("xmm0","xmm0",&QWP(16*3,"eax"));	# fix counters
 	&vpshufd	("xmm4","xmm7",0x00);
 	&vpshufd	("xmm5","xmm7",0x55);
 	 &vpsubd	("xmm0","xmm0",&QWP(16*4,"eax"));
 	&vpshufd	("xmm6","xmm7",0xaa);
 	&vpshufd	("xmm7","xmm7",0xff);
 	&vmovdqa	(&QWP(16*12-128,"ebp"),"xmm0");
 	&vmovdqa	(&QWP(16*13-128,"ebp"),"xmm1");
 	&vmovdqa	(&QWP(16*14-128,"ebp"),"xmm2");
 	&vmovdqa	(&QWP(16*15-128,"ebp"),"xmm3");
 	 &vmovdqu	("xmm3",&QWP(16,"edx"));	# key
 	&vmovdqa	(&QWP(16*4-128,"ebp"),"xmm4");
 	&vmovdqa	(&QWP(16*5-128,"ebp"),"xmm5");
 	&vmovdqa	(&QWP(16*6-128,"ebp"),"xmm6");
 	&vmovdqa	(&QWP(16*7-128,"ebp"),"xmm7");
 	 &vmovdqa	("xmm7",&QWP(16*2,"eax"));	# sigma
 	 &lea		("ebx",&DWP(128,"esp"));	# size optimization

 	&vpshufd	("xmm0","xmm3",0x00);
 	&vpshufd	("xmm1","xmm3",0x55);
 	&vpshufd	("xmm2","xmm3",0xaa);
 	&vpshufd	("xmm3","xmm3",0xff);
 	&vpshufd	("xmm4","xmm7",0x00);
 	&vpshufd	("xmm5","xmm7",0x55);
 	&vpshufd	("xmm6","xmm7",0xaa);
 	&vpshufd	("xmm7","xmm7",0xff);
 	&vmovdqa	(&QWP(16*8-128,"ebp"),"xmm0");
 	&vmovdqa	(&QWP(16*9-128,"ebp"),"xmm1");
 	&vmovdqa	(&QWP(16*10-128,"ebp"),"xmm2");
 	&vmovdqa	(&QWP(16*11-128,"ebp"),"xmm3");
 	&vmovdqa	(&QWP(16*0-128,"ebp"),"xmm4");
 	&vmovdqa	(&QWP(16*1-128,"ebp"),"xmm5");
 	&vmovdqa	(&QWP(16*2-128,"ebp"),"xmm6");
 	&vmovdqa	(&QWP(16*3-128,"ebp"),"xmm7");

 	&lea		($inp,&DWP(128,$inp));		# size optimization
 	&lea		($out,&DWP(128,$out));		# size optimization
 	&jmp		(&label("outer_loop"));

 &set_label("outer_loop",32);
 	#&vmovdqa	("xmm0",&QWP(16*0-128,"ebp"));	# copy key material
 	&vmovdqa	("xmm1",&QWP(16*1-128,"ebp"));
 	&vmovdqa	("xmm2",&QWP(16*2-128,"ebp"));
 	&vmovdqa	("xmm3",&QWP(16*3-128,"ebp"));
 	#&vmovdqa	("xmm4",&QWP(16*4-128,"ebp"));
 	&vmovdqa	("xmm5",&QWP(16*5-128,"ebp"));
 	&vmovdqa	("xmm6",&QWP(16*6-128,"ebp"));
 	&vmovdqa	("xmm7",&QWP(16*7-128,"ebp"));
 	#&vmovdqa	(&QWP(16*0-128,"ebx"),"xmm0");
 	&vmovdqa	(&QWP(16*1-128,"ebx"),"xmm1");
 	&vmovdqa	(&QWP(16*2-128,"ebx"),"xmm2");
 	&vmovdqa	(&QWP(16*3-128,"ebx"),"xmm3");
 	#&vmovdqa	(&QWP(16*4-128,"ebx"),"xmm4");
 	&vmovdqa	(&QWP(16*5-128,"ebx"),"xmm5");
 	&vmovdqa	(&QWP(16*6-128,"ebx"),"xmm6");
 	&vmovdqa	(&QWP(16*7-128,"ebx"),"xmm7");
 	#&vmovdqa	("xmm0",&QWP(16*8-128,"ebp"));
 	#&vmovdqa	("xmm1",&QWP(16*9-128,"ebp"));
 	&vmovdqa	("xmm2",&QWP(16*10-128,"ebp"));
 	&vmovdqa	("xmm3",&QWP(16*11-128,"ebp"));
 	&vmovdqa	("xmm4",&QWP(16*12-128,"ebp"));
 	&vmovdqa	("xmm5",&QWP(16*13-128,"ebp"));
 	&vmovdqa	("xmm6",&QWP(16*14-128,"ebp"));
 	&vmovdqa	("xmm7",&QWP(16*15-128,"ebp"));
 	&vpaddd		("xmm4","xmm4",&QWP(16*4,"eax"));	# counter value
 	#&vmovdqa	(&QWP(16*8-128,"ebx"),"xmm0");
 	#&vmovdqa	(&QWP(16*9-128,"ebx"),"xmm1");
 	&vmovdqa	(&QWP(16*10-128,"ebx"),"xmm2");
 	&vmovdqa	(&QWP(16*11-128,"ebx"),"xmm3");
 	&vmovdqa	(&QWP(16*12-128,"ebx"),"xmm4");
 	&vmovdqa	(&QWP(16*13-128,"ebx"),"xmm5");
 	&vmovdqa	(&QWP(16*14-128,"ebx"),"xmm6");
 	&vmovdqa	(&QWP(16*15-128,"ebx"),"xmm7");
 	&vmovdqa	(&QWP(16*12-128,"ebp"),"xmm4");	# save counter value

 	&vmovdqa	($xa, &QWP(16*0-128,"ebp"));
 	&vmovdqa	($xd, "xmm4");
 	&vmovdqa	($xb_,&QWP(16*4-128,"ebp"));
 	&vmovdqa	($xc, &QWP(16*8-128,"ebp"));
 	&vmovdqa	($xc_,&QWP(16*9-128,"ebp"));

 	&mov		("edx",10);			# loop counter
 	&nop		();

 &set_label("loop",32);
 	&vpaddd		($xa,$xa,$xb_);			# elsewhere
 	&vpxor		($xd,$xd,$xa);			# elsewhere
 	&QUARTERROUND_XOP(0, 4, 8, 12, 0);
 	&QUARTERROUND_XOP(1, 5, 9, 13, 1);
 	&QUARTERROUND_XOP(2, 6,10, 14, 2);
 	&QUARTERROUND_XOP(3, 7,11, 15, 3);
 	&QUARTERROUND_XOP(0, 5,10, 15, 4);
 	&QUARTERROUND_XOP(1, 6,11, 12, 5);
 	&QUARTERROUND_XOP(2, 7, 8, 13, 6);
 	&QUARTERROUND_XOP(3, 4, 9, 14, 7);
 	&dec		("edx");
 	&jnz		(&label("loop"));

 	&vmovdqa	(&QWP(16*4-128,"ebx"),$xb_);
 	&vmovdqa	(&QWP(16*8-128,"ebx"),$xc);
 	&vmovdqa	(&QWP(16*9-128,"ebx"),$xc_);
 	&vmovdqa	(&QWP(16*12-128,"ebx"),$xd);
 	&vmovdqa	(&QWP(16*14-128,"ebx"),$xd_);

    my ($xa0,$xa1,$xa2,$xa3,$xt0,$xt1,$xt2,$xt3)=map("xmm$_",(0..7));

 	#&vmovdqa	($xa0,&QWP(16*0-128,"ebx"));	# it's there
 	&vmovdqa	($xa1,&QWP(16*1-128,"ebx"));
 	&vmovdqa	($xa2,&QWP(16*2-128,"ebx"));
 	&vmovdqa	($xa3,&QWP(16*3-128,"ebx"));

    for($i=0;$i<256;$i+=64) {
 	&vpaddd		($xa0,$xa0,&QWP($i+16*0-128,"ebp"));	# accumulate key material
 	&vpaddd		($xa1,$xa1,&QWP($i+16*1-128,"ebp"));
 	&vpaddd		($xa2,$xa2,&QWP($i+16*2-128,"ebp"));
 	&vpaddd		($xa3,$xa3,&QWP($i+16*3-128,"ebp"));

 	&vpunpckldq	($xt2,$xa0,$xa1);	# "de-interlace" data
 	&vpunpckldq	($xt3,$xa2,$xa3);
 	&vpunpckhdq	($xa0,$xa0,$xa1);
 	&vpunpckhdq	($xa2,$xa2,$xa3);
 	&vpunpcklqdq	($xa1,$xt2,$xt3);	# "a0"
 	&vpunpckhqdq	($xt2,$xt2,$xt3);	# "a1"
 	&vpunpcklqdq	($xt3,$xa0,$xa2);	# "a2"
 	&vpunpckhqdq	($xa3,$xa0,$xa2);	# "a3"

 	&vpxor		($xt0,$xa1,&QWP(64*0-128,$inp));
 	&vpxor		($xt1,$xt2,&QWP(64*1-128,$inp));
 	&vpxor		($xt2,$xt3,&QWP(64*2-128,$inp));
 	&vpxor		($xt3,$xa3,&QWP(64*3-128,$inp));
 	&lea		($inp,&QWP($i<192?16:(64*4-16*3),$inp));
 	&vmovdqa	($xa0,&QWP($i+16*4-128,"ebx"))	if ($i<192);
 	&vmovdqa	($xa1,&QWP($i+16*5-128,"ebx"))	if ($i<192);
 	&vmovdqa	($xa2,&QWP($i+16*6-128,"ebx"))	if ($i<192);
 	&vmovdqa	($xa3,&QWP($i+16*7-128,"ebx"))	if ($i<192);
 	&vmovdqu	(&QWP(64*0-128,$out),$xt0);	# store output
 	&vmovdqu	(&QWP(64*1-128,$out),$xt1);
 	&vmovdqu	(&QWP(64*2-128,$out),$xt2);
 	&vmovdqu	(&QWP(64*3-128,$out),$xt3);
 	&lea		($out,&QWP($i<192?16:(64*4-16*3),$out));
    }
 	&sub		($len,64*4);
 	&jnc		(&label("outer_loop"));

 	&add		($len,64*4);
 	&jz		(&label("done"));

 	&mov		("ebx",&DWP(512+8,"esp"));	# restore pointers
 	&lea		($inp,&DWP(-128,$inp));
 	&mov		("edx",&DWP(512+4,"esp"));
 	&lea		($out,&DWP(-128,$out));

 	&vmovd		("xmm2",&DWP(16*12-128,"ebp"));	# counter value
 	&vmovdqu	("xmm3",&QWP(0,"ebx"));
 	&vpaddd		("xmm2","xmm2",&QWP(16*6,"eax"));# +four
 	&vpand		("xmm3","xmm3",&QWP(16*7,"eax"));
 	&vpor		("xmm3","xmm3","xmm2");		# counter value
 {
 my ($a,$b,$c,$d,$t,$t1,$rot16,$rot24)=map("xmm$_",(0..7));

 sub XOPROUND {
 	&vpaddd		($a,$a,$b);
 	&vpxor		($d,$d,$a);
 	&vprotd		($d,$d,16);

 	&vpaddd		($c,$c,$d);
 	&vpxor		($b,$b,$c);
 	&vprotd		($b,$b,12);

 	&vpaddd		($a,$a,$b);
 	&vpxor		($d,$d,$a);
 	&vprotd		($d,$d,8);

 	&vpaddd		($c,$c,$d);
 	&vpxor		($b,$b,$c);
 	&vprotd		($b,$b,7);
 }

 &set_label("1x");
 	&vmovdqa	($a,&QWP(16*2,"eax"));		# sigma
 	&vmovdqu	($b,&QWP(0,"edx"));
 	&vmovdqu	($c,&QWP(16,"edx"));
 	#&vmovdqu	($d,&QWP(0,"ebx"));		# already loaded
 	&vmovdqa	($rot16,&QWP(0,"eax"));
 	&vmovdqa	($rot24,&QWP(16,"eax"));
 	&mov		(&DWP(16*3,"esp"),"ebp");

 	&vmovdqa	(&QWP(16*0,"esp"),$a);
 	&vmovdqa	(&QWP(16*1,"esp"),$b);
 	&vmovdqa	(&QWP(16*2,"esp"),$c);
 	&vmovdqa	(&QWP(16*3,"esp"),$d);
 	&mov		("edx",10);
 	&jmp		(&label("loop1x"));

 &set_label("outer1x",16);
 	&vmovdqa	($d,&QWP(16*5,"eax"));		# one
 	&vmovdqa	($a,&QWP(16*0,"esp"));
 	&vmovdqa	($b,&QWP(16*1,"esp"));
 	&vmovdqa	($c,&QWP(16*2,"esp"));
 	&vpaddd		($d,$d,&QWP(16*3,"esp"));
 	&mov		("edx",10);
 	&vmovdqa	(&QWP(16*3,"esp"),$d);
 	&jmp		(&label("loop1x"));

 &set_label("loop1x",16);
 	&XOPROUND();
 	&vpshufd	($c,$c,0b01001110);
 	&vpshufd	($b,$b,0b00111001);
 	&vpshufd	($d,$d,0b10010011);

 	&XOPROUND();
 	&vpshufd	($c,$c,0b01001110);
 	&vpshufd	($b,$b,0b10010011);
 	&vpshufd	($d,$d,0b00111001);

 	&dec		("edx");
 	&jnz		(&label("loop1x"));

 	&vpaddd		($a,$a,&QWP(16*0,"esp"));
 	&vpaddd		($b,$b,&QWP(16*1,"esp"));
 	&vpaddd		($c,$c,&QWP(16*2,"esp"));
 	&vpaddd		($d,$d,&QWP(16*3,"esp"));

 	&cmp		($len,64);
 	&jb		(&label("tail"));

 	&vpxor		($a,$a,&QWP(16*0,$inp));	# xor with input
 	&vpxor		($b,$b,&QWP(16*1,$inp));
 	&vpxor		($c,$c,&QWP(16*2,$inp));
 	&vpxor		($d,$d,&QWP(16*3,$inp));
 	&lea		($inp,&DWP(16*4,$inp));		# inp+=64

 	&vmovdqu	(&QWP(16*0,$out),$a);		# write output
 	&vmovdqu	(&QWP(16*1,$out),$b);
 	&vmovdqu	(&QWP(16*2,$out),$c);
 	&vmovdqu	(&QWP(16*3,$out),$d);
 	&lea		($out,&DWP(16*4,$out));		# inp+=64

 	&sub		($len,64);
 	&jnz		(&label("outer1x"));

 	&jmp		(&label("done"));

 &set_label("tail");
 	&vmovdqa	(&QWP(16*0,"esp"),$a);
 	&vmovdqa	(&QWP(16*1,"esp"),$b);
 	&vmovdqa	(&QWP(16*2,"esp"),$c);
 	&vmovdqa	(&QWP(16*3,"esp"),$d);

 	&xor		("eax","eax");
 	&xor		("edx","edx");
 	&xor		("ebp","ebp");

 &set_label("tail_loop");
 	&movb		("al",&BP(0,"esp","ebp"));
 	&movb		("dl",&BP(0,$inp,"ebp"));
 	&lea		("ebp",&DWP(1,"ebp"));
 	&xor		("al","dl");
 	&movb		(&BP(-1,$out,"ebp"),"al");
 	&dec		($len);
 	&jnz		(&label("tail_loop"));
 }
 &set_label("done");
 	&vzeroupper	();
 	&mov		("esp",&DWP(512,"esp"));
 &function_end("ChaCha20_xop");
 }

 &asm_finish();
--- a/crypto/chacha/asm/chacha-x86_64.pl
+++ b/crypto/chacha/asm/chacha-x86_64.pl
@@ -36,6 +36,8 @@
 #	limitations, SSE2 can do better, but gain is considered too
 #	low to justify the [maintenance] effort;
 # (iv)	Bulldozer actually executes 4xXOP code path that delivers 2.20;
 #
 # Modified from upstream OpenSSL to remove the XOP code.

 $flavour = shift;
 $output  = shift;
@@ -48,24 +50,7 @@ $0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
 ( $xlate="${dir}../../perlasm/x86_64-xlate.pl" and -f $xlate) or
 die "can't locate x86_64-xlate.pl";

 if (`$ENV{CC} -Wa,-v -c -o /dev/null -x assembler /dev/null 2>&1`
 		=~ /GNU assembler version ([2-9]\.[0-9]+)/) {
 	$avx = ($1>=2.19) + ($1>=2.22);
 }

 if (!$avx && $win64 && ($flavour =~ /nasm/ || $ENV{ASM} =~ /nasm/) &&
 	   `nasm -v 2>&1` =~ /NASM version ([2-9]\.[0-9]+)/) {
 	$avx = ($1>=2.09) + ($1>=2.10);
 }

 if (!$avx && $win64 && ($flavour =~ /masm/ || $ENV{ASM} =~ /ml64/) &&
 	   `ml64 2>&1` =~ /Version ([0-9]+)\./) {
 	$avx = ($1>=10) + ($1>=11);
 }

 if (!$avx && `$ENV{CC} -v 2>&1` =~ /((?:^clang|LLVM) version|.*based on LLVM) ([3-9]\.[0-9]+)/) {
 	$avx = ($2>=3.0) + ($2>3.0);
 }
 $avx = 2;

 open OUT,"| \"$^X\" $xlate $flavour $output";
 *STDOUT=*OUT;
@@ -419,10 +404,6 @@ $code.=<<___;
 ChaCha20_ssse3:
 .LChaCha20_ssse3:
 ___
 $code.=<<___	if ($avx);
 	test	\$`1<<(43-32)`,%r10d
 	jnz	.LChaCha20_4xop		# XOP is fastest even if we use 1/4
 ___
 $code.=<<___;
 	cmp	\$128,$len		# we might throw away some data,
 	ja	.LChaCha20_4x		# but overall it won't be slower
@@ -1133,457 +1114,6 @@ $code.=<<___;
 ___
 }

 ########################################################################
 # XOP code path that handles all lengths.
 if ($avx) {
 # There is some "anomaly" observed depending on instructions' size or
 # alignment. If you look closely at below code you'll notice that
 # sometimes argument order varies. The order affects instruction
 # encoding by making it larger, and such fiddling gives 5% performance
 # improvement. This is on FX-4100...

 my ($xb0,$xb1,$xb2,$xb3, $xd0,$xd1,$xd2,$xd3,
    $xa0,$xa1,$xa2,$xa3, $xt0,$xt1,$xt2,$xt3)=map("%xmm$_",(0..15));
 my  @xx=($xa0,$xa1,$xa2,$xa3, $xb0,$xb1,$xb2,$xb3,
 	 $xt0,$xt1,$xt2,$xt3, $xd0,$xd1,$xd2,$xd3);

 sub XOP_lane_ROUND {
 my ($a0,$b0,$c0,$d0)=@_;
 my ($a1,$b1,$c1,$d1)=map(($_&~3)+(($_+1)&3),($a0,$b0,$c0,$d0));
 my ($a2,$b2,$c2,$d2)=map(($_&~3)+(($_+1)&3),($a1,$b1,$c1,$d1));
 my ($a3,$b3,$c3,$d3)=map(($_&~3)+(($_+1)&3),($a2,$b2,$c2,$d2));
 my @x=map("\"$_\"",@xx);

 	(
 	"&vpaddd	(@x[$a0],@x[$a0],@x[$b0])",	# Q1
 	 "&vpaddd	(@x[$a1],@x[$a1],@x[$b1])",	# Q2
 	  "&vpaddd	(@x[$a2],@x[$a2],@x[$b2])",	# Q3
 	   "&vpaddd	(@x[$a3],@x[$a3],@x[$b3])",	# Q4
 	"&vpxor		(@x[$d0],@x[$a0],@x[$d0])",
 	 "&vpxor	(@x[$d1],@x[$a1],@x[$d1])",
 	  "&vpxor	(@x[$d2],@x[$a2],@x[$d2])",
 	   "&vpxor	(@x[$d3],@x[$a3],@x[$d3])",
 	"&vprotd	(@x[$d0],@x[$d0],16)",
 	 "&vprotd	(@x[$d1],@x[$d1],16)",
 	  "&vprotd	(@x[$d2],@x[$d2],16)",
 	   "&vprotd	(@x[$d3],@x[$d3],16)",

 	"&vpaddd	(@x[$c0],@x[$c0],@x[$d0])",
 	 "&vpaddd	(@x[$c1],@x[$c1],@x[$d1])",
 	  "&vpaddd	(@x[$c2],@x[$c2],@x[$d2])",
 	   "&vpaddd	(@x[$c3],@x[$c3],@x[$d3])",
 	"&vpxor		(@x[$b0],@x[$c0],@x[$b0])",
 	 "&vpxor	(@x[$b1],@x[$c1],@x[$b1])",
 	  "&vpxor	(@x[$b2],@x[$b2],@x[$c2])",	# flip
 	   "&vpxor	(@x[$b3],@x[$b3],@x[$c3])",	# flip
 	"&vprotd	(@x[$b0],@x[$b0],12)",
 	 "&vprotd	(@x[$b1],@x[$b1],12)",
 	  "&vprotd	(@x[$b2],@x[$b2],12)",
 	   "&vprotd	(@x[$b3],@x[$b3],12)",

 	"&vpaddd	(@x[$a0],@x[$b0],@x[$a0])",	# flip
 	 "&vpaddd	(@x[$a1],@x[$b1],@x[$a1])",	# flip
 	  "&vpaddd	(@x[$a2],@x[$a2],@x[$b2])",
 	   "&vpaddd	(@x[$a3],@x[$a3],@x[$b3])",
 	"&vpxor		(@x[$d0],@x[$a0],@x[$d0])",
 	 "&vpxor	(@x[$d1],@x[$a1],@x[$d1])",
 	  "&vpxor	(@x[$d2],@x[$a2],@x[$d2])",
 	   "&vpxor	(@x[$d3],@x[$a3],@x[$d3])",
 	"&vprotd	(@x[$d0],@x[$d0],8)",
 	 "&vprotd	(@x[$d1],@x[$d1],8)",
 	  "&vprotd	(@x[$d2],@x[$d2],8)",
 	   "&vprotd	(@x[$d3],@x[$d3],8)",

 	"&vpaddd	(@x[$c0],@x[$c0],@x[$d0])",
 	 "&vpaddd	(@x[$c1],@x[$c1],@x[$d1])",
 	  "&vpaddd	(@x[$c2],@x[$c2],@x[$d2])",
 	   "&vpaddd	(@x[$c3],@x[$c3],@x[$d3])",
 	"&vpxor		(@x[$b0],@x[$c0],@x[$b0])",
 	 "&vpxor	(@x[$b1],@x[$c1],@x[$b1])",
 	  "&vpxor	(@x[$b2],@x[$b2],@x[$c2])",	# flip
 	   "&vpxor	(@x[$b3],@x[$b3],@x[$c3])",	# flip
 	"&vprotd	(@x[$b0],@x[$b0],7)",
 	 "&vprotd	(@x[$b1],@x[$b1],7)",
 	  "&vprotd	(@x[$b2],@x[$b2],7)",
 	   "&vprotd	(@x[$b3],@x[$b3],7)"
 	);
 }

 my $xframe = $win64 ? 0xa0 : 0;

 $code.=<<___;
 .type	ChaCha20_4xop,\@function,5
 .align	32
 ChaCha20_4xop:
 .LChaCha20_4xop:
 	lea		-0x78(%rsp),%r11
 	sub		\$0x148+$xframe,%rsp
 ___
 	################ stack layout
 	# +0x00		SIMD equivalent of @x[8-12]
 	# ...
 	# +0x40		constant copy of key[0-2] smashed by lanes
 	# ...
 	# +0x100	SIMD counters (with nonce smashed by lanes)
 	# ...
 	# +0x140
 $code.=<<___	if ($win64);
 	movaps		%xmm6,-0x30(%r11)
 	movaps		%xmm7,-0x20(%r11)
 	movaps		%xmm8,-0x10(%r11)
 	movaps		%xmm9,0x00(%r11)
 	movaps		%xmm10,0x10(%r11)
 	movaps		%xmm11,0x20(%r11)
 	movaps		%xmm12,0x30(%r11)
 	movaps		%xmm13,0x40(%r11)
 	movaps		%xmm14,0x50(%r11)
 	movaps		%xmm15,0x60(%r11)
 ___
 $code.=<<___;
 	vzeroupper

 	vmovdqa		.Lsigma(%rip),$xa3	# key[0]
 	vmovdqu		($key),$xb3		# key[1]
 	vmovdqu		16($key),$xt3		# key[2]
 	vmovdqu		($counter),$xd3		# key[3]
 	lea		0x100(%rsp),%rcx	# size optimization

 	vpshufd		\$0x00,$xa3,$xa0	# smash key by lanes...
 	vpshufd		\$0x55,$xa3,$xa1
 	vmovdqa		$xa0,0x40(%rsp)		# ... and offload
 	vpshufd		\$0xaa,$xa3,$xa2
 	vmovdqa		$xa1,0x50(%rsp)
 	vpshufd		\$0xff,$xa3,$xa3
 	vmovdqa		$xa2,0x60(%rsp)
 	vmovdqa		$xa3,0x70(%rsp)

 	vpshufd		\$0x00,$xb3,$xb0
 	vpshufd		\$0x55,$xb3,$xb1
 	vmovdqa		$xb0,0x80-0x100(%rcx)
 	vpshufd		\$0xaa,$xb3,$xb2
 	vmovdqa		$xb1,0x90-0x100(%rcx)
 	vpshufd		\$0xff,$xb3,$xb3
 	vmovdqa		$xb2,0xa0-0x100(%rcx)
 	vmovdqa		$xb3,0xb0-0x100(%rcx)

 	vpshufd		\$0x00,$xt3,$xt0	# "$xc0"
 	vpshufd		\$0x55,$xt3,$xt1	# "$xc1"
 	vmovdqa		$xt0,0xc0-0x100(%rcx)
 	vpshufd		\$0xaa,$xt3,$xt2	# "$xc2"
 	vmovdqa		$xt1,0xd0-0x100(%rcx)
 	vpshufd		\$0xff,$xt3,$xt3	# "$xc3"
 	vmovdqa		$xt2,0xe0-0x100(%rcx)
 	vmovdqa		$xt3,0xf0-0x100(%rcx)

 	vpshufd		\$0x00,$xd3,$xd0
 	vpshufd		\$0x55,$xd3,$xd1
 	vpaddd		.Linc(%rip),$xd0,$xd0	# don't save counters yet
 	vpshufd		\$0xaa,$xd3,$xd2
 	vmovdqa		$xd1,0x110-0x100(%rcx)
 	vpshufd		\$0xff,$xd3,$xd3
 	vmovdqa		$xd2,0x120-0x100(%rcx)
 	vmovdqa		$xd3,0x130-0x100(%rcx)

 	jmp		.Loop_enter4xop

 .align	32
 .Loop_outer4xop:
 	vmovdqa		0x40(%rsp),$xa0		# re-load smashed key
 	vmovdqa		0x50(%rsp),$xa1
 	vmovdqa		0x60(%rsp),$xa2
 	vmovdqa		0x70(%rsp),$xa3
 	vmovdqa		0x80-0x100(%rcx),$xb0
 	vmovdqa		0x90-0x100(%rcx),$xb1
 	vmovdqa		0xa0-0x100(%rcx),$xb2
 	vmovdqa		0xb0-0x100(%rcx),$xb3
 	vmovdqa		0xc0-0x100(%rcx),$xt0	# "$xc0"
 	vmovdqa		0xd0-0x100(%rcx),$xt1	# "$xc1"
 	vmovdqa		0xe0-0x100(%rcx),$xt2	# "$xc2"
 	vmovdqa		0xf0-0x100(%rcx),$xt3	# "$xc3"
 	vmovdqa		0x100-0x100(%rcx),$xd0
 	vmovdqa		0x110-0x100(%rcx),$xd1
 	vmovdqa		0x120-0x100(%rcx),$xd2
 	vmovdqa		0x130-0x100(%rcx),$xd3
 	vpaddd		.Lfour(%rip),$xd0,$xd0	# next SIMD counters

 .Loop_enter4xop:
 	mov		\$10,%eax
 	vmovdqa		$xd0,0x100-0x100(%rcx)	# save SIMD counters
 	jmp		.Loop4xop

 .align	32
 .Loop4xop:
 ___
 	foreach (&XOP_lane_ROUND(0, 4, 8,12)) { eval; }
 	foreach (&XOP_lane_ROUND(0, 5,10,15)) { eval; }
 $code.=<<___;
 	dec		%eax
 	jnz		.Loop4xop

 	vpaddd		0x40(%rsp),$xa0,$xa0	# accumulate key material
 	vpaddd		0x50(%rsp),$xa1,$xa1
 	vpaddd		0x60(%rsp),$xa2,$xa2
 	vpaddd		0x70(%rsp),$xa3,$xa3

 	vmovdqa		$xt2,0x20(%rsp)		# offload $xc2,3
 	vmovdqa		$xt3,0x30(%rsp)

 	vpunpckldq	$xa1,$xa0,$xt2		# "de-interlace" data
 	vpunpckldq	$xa3,$xa2,$xt3
 	vpunpckhdq	$xa1,$xa0,$xa0
 	vpunpckhdq	$xa3,$xa2,$xa2
 	vpunpcklqdq	$xt3,$xt2,$xa1		# "a0"
 	vpunpckhqdq	$xt3,$xt2,$xt2		# "a1"
 	vpunpcklqdq	$xa2,$xa0,$xa3		# "a2"
 	vpunpckhqdq	$xa2,$xa0,$xa0		# "a3"
 ___
        ($xa0,$xa1,$xa2,$xa3,$xt2)=($xa1,$xt2,$xa3,$xa0,$xa2);
 $code.=<<___;
 	vpaddd		0x80-0x100(%rcx),$xb0,$xb0
 	vpaddd		0x90-0x100(%rcx),$xb1,$xb1
 	vpaddd		0xa0-0x100(%rcx),$xb2,$xb2
 	vpaddd		0xb0-0x100(%rcx),$xb3,$xb3

 	vmovdqa		$xa0,0x00(%rsp)		# offload $xa0,1
 	vmovdqa		$xa1,0x10(%rsp)
 	vmovdqa		0x20(%rsp),$xa0		# "xc2"
 	vmovdqa		0x30(%rsp),$xa1		# "xc3"

 	vpunpckldq	$xb1,$xb0,$xt2
 	vpunpckldq	$xb3,$xb2,$xt3
 	vpunpckhdq	$xb1,$xb0,$xb0
 	vpunpckhdq	$xb3,$xb2,$xb2
 	vpunpcklqdq	$xt3,$xt2,$xb1		# "b0"
 	vpunpckhqdq	$xt3,$xt2,$xt2		# "b1"
 	vpunpcklqdq	$xb2,$xb0,$xb3		# "b2"
 	vpunpckhqdq	$xb2,$xb0,$xb0		# "b3"
 ___
 	($xb0,$xb1,$xb2,$xb3,$xt2)=($xb1,$xt2,$xb3,$xb0,$xb2);
 	my ($xc0,$xc1,$xc2,$xc3)=($xt0,$xt1,$xa0,$xa1);
 $code.=<<___;
 	vpaddd		0xc0-0x100(%rcx),$xc0,$xc0
 	vpaddd		0xd0-0x100(%rcx),$xc1,$xc1
 	vpaddd		0xe0-0x100(%rcx),$xc2,$xc2
 	vpaddd		0xf0-0x100(%rcx),$xc3,$xc3

 	vpunpckldq	$xc1,$xc0,$xt2
 	vpunpckldq	$xc3,$xc2,$xt3
 	vpunpckhdq	$xc1,$xc0,$xc0
 	vpunpckhdq	$xc3,$xc2,$xc2
 	vpunpcklqdq	$xt3,$xt2,$xc1		# "c0"
 	vpunpckhqdq	$xt3,$xt2,$xt2		# "c1"
 	vpunpcklqdq	$xc2,$xc0,$xc3		# "c2"
 	vpunpckhqdq	$xc2,$xc0,$xc0		# "c3"
 ___
 	($xc0,$xc1,$xc2,$xc3,$xt2)=($xc1,$xt2,$xc3,$xc0,$xc2);
 $code.=<<___;
 	vpaddd		0x100-0x100(%rcx),$xd0,$xd0
 	vpaddd		0x110-0x100(%rcx),$xd1,$xd1
 	vpaddd		0x120-0x100(%rcx),$xd2,$xd2
 	vpaddd		0x130-0x100(%rcx),$xd3,$xd3

 	vpunpckldq	$xd1,$xd0,$xt2
 	vpunpckldq	$xd3,$xd2,$xt3
 	vpunpckhdq	$xd1,$xd0,$xd0
 	vpunpckhdq	$xd3,$xd2,$xd2
 	vpunpcklqdq	$xt3,$xt2,$xd1		# "d0"
 	vpunpckhqdq	$xt3,$xt2,$xt2		# "d1"
 	vpunpcklqdq	$xd2,$xd0,$xd3		# "d2"
 	vpunpckhqdq	$xd2,$xd0,$xd0		# "d3"
 ___
 	($xd0,$xd1,$xd2,$xd3,$xt2)=($xd1,$xt2,$xd3,$xd0,$xd2);
 	($xa0,$xa1)=($xt2,$xt3);
 $code.=<<___;
 	vmovdqa		0x00(%rsp),$xa0		# restore $xa0,1
 	vmovdqa		0x10(%rsp),$xa1

 	cmp		\$64*4,$len
 	jb		.Ltail4xop

 	vpxor		0x00($inp),$xa0,$xa0	# xor with input
 	vpxor		0x10($inp),$xb0,$xb0
 	vpxor		0x20($inp),$xc0,$xc0
 	vpxor		0x30($inp),$xd0,$xd0
 	vpxor		0x40($inp),$xa1,$xa1
 	vpxor		0x50($inp),$xb1,$xb1
 	vpxor		0x60($inp),$xc1,$xc1
 	vpxor		0x70($inp),$xd1,$xd1
 	lea		0x80($inp),$inp		# size optimization
 	vpxor		0x00($inp),$xa2,$xa2
 	vpxor		0x10($inp),$xb2,$xb2
 	vpxor		0x20($inp),$xc2,$xc2
 	vpxor		0x30($inp),$xd2,$xd2
 	vpxor		0x40($inp),$xa3,$xa3
 	vpxor		0x50($inp),$xb3,$xb3
 	vpxor		0x60($inp),$xc3,$xc3
 	vpxor		0x70($inp),$xd3,$xd3
 	lea		0x80($inp),$inp		# inp+=64*4

 	vmovdqu		$xa0,0x00($out)
 	vmovdqu		$xb0,0x10($out)
 	vmovdqu		$xc0,0x20($out)
 	vmovdqu		$xd0,0x30($out)
 	vmovdqu		$xa1,0x40($out)
 	vmovdqu		$xb1,0x50($out)
 	vmovdqu		$xc1,0x60($out)
 	vmovdqu		$xd1,0x70($out)
 	lea		0x80($out),$out		# size optimization
 	vmovdqu		$xa2,0x00($out)
 	vmovdqu		$xb2,0x10($out)
 	vmovdqu		$xc2,0x20($out)
 	vmovdqu		$xd2,0x30($out)
 	vmovdqu		$xa3,0x40($out)
 	vmovdqu		$xb3,0x50($out)
 	vmovdqu		$xc3,0x60($out)
 	vmovdqu		$xd3,0x70($out)
 	lea		0x80($out),$out		# out+=64*4

 	sub		\$64*4,$len
 	jnz		.Loop_outer4xop

 	jmp		.Ldone4xop

 .align	32
 .Ltail4xop:
 	cmp		\$192,$len
 	jae		.L192_or_more4xop
 	cmp		\$128,$len
 	jae		.L128_or_more4xop
 	cmp		\$64,$len
 	jae		.L64_or_more4xop

 	xor		%r10,%r10
 	vmovdqa		$xa0,0x00(%rsp)
 	vmovdqa		$xb0,0x10(%rsp)
 	vmovdqa		$xc0,0x20(%rsp)
 	vmovdqa		$xd0,0x30(%rsp)
 	jmp		.Loop_tail4xop

 .align	32
 .L64_or_more4xop:
 	vpxor		0x00($inp),$xa0,$xa0	# xor with input
 	vpxor		0x10($inp),$xb0,$xb0
 	vpxor		0x20($inp),$xc0,$xc0
 	vpxor		0x30($inp),$xd0,$xd0
 	vmovdqu		$xa0,0x00($out)
 	vmovdqu		$xb0,0x10($out)
 	vmovdqu		$xc0,0x20($out)
 	vmovdqu		$xd0,0x30($out)
 	je		.Ldone4xop

 	lea		0x40($inp),$inp		# inp+=64*1
 	vmovdqa		$xa1,0x00(%rsp)
 	xor		%r10,%r10
 	vmovdqa		$xb1,0x10(%rsp)
 	lea		0x40($out),$out		# out+=64*1
 	vmovdqa		$xc1,0x20(%rsp)
 	sub		\$64,$len		# len-=64*1
 	vmovdqa		$xd1,0x30(%rsp)
 	jmp		.Loop_tail4xop

 .align	32
 .L128_or_more4xop:
 	vpxor		0x00($inp),$xa0,$xa0	# xor with input
 	vpxor		0x10($inp),$xb0,$xb0
 	vpxor		0x20($inp),$xc0,$xc0
 	vpxor		0x30($inp),$xd0,$xd0
 	vpxor		0x40($inp),$xa1,$xa1
 	vpxor		0x50($inp),$xb1,$xb1
 	vpxor		0x60($inp),$xc1,$xc1
 	vpxor		0x70($inp),$xd1,$xd1

 	vmovdqu		$xa0,0x00($out)
 	vmovdqu		$xb0,0x10($out)
 	vmovdqu		$xc0,0x20($out)
 	vmovdqu		$xd0,0x30($out)
 	vmovdqu		$xa1,0x40($out)
 	vmovdqu		$xb1,0x50($out)
 	vmovdqu		$xc1,0x60($out)
 	vmovdqu		$xd1,0x70($out)
 	je		.Ldone4xop

 	lea		0x80($inp),$inp		# inp+=64*2
 	vmovdqa		$xa2,0x00(%rsp)
 	xor		%r10,%r10
 	vmovdqa		$xb2,0x10(%rsp)
 	lea		0x80($out),$out		# out+=64*2
 	vmovdqa		$xc2,0x20(%rsp)
 	sub		\$128,$len		# len-=64*2
 	vmovdqa		$xd2,0x30(%rsp)
 	jmp		.Loop_tail4xop

 .align	32
 .L192_or_more4xop:
 	vpxor		0x00($inp),$xa0,$xa0	# xor with input
 	vpxor		0x10($inp),$xb0,$xb0
 	vpxor		0x20($inp),$xc0,$xc0
 	vpxor		0x30($inp),$xd0,$xd0
 	vpxor		0x40($inp),$xa1,$xa1
 	vpxor		0x50($inp),$xb1,$xb1
 	vpxor		0x60($inp),$xc1,$xc1
 	vpxor		0x70($inp),$xd1,$xd1
 	lea		0x80($inp),$inp		# size optimization
 	vpxor		0x00($inp),$xa2,$xa2
 	vpxor		0x10($inp),$xb2,$xb2
 	vpxor		0x20($inp),$xc2,$xc2
 	vpxor		0x30($inp),$xd2,$xd2

 	vmovdqu		$xa0,0x00($out)
 	vmovdqu		$xb0,0x10($out)
 	vmovdqu		$xc0,0x20($out)
 	vmovdqu		$xd0,0x30($out)
 	vmovdqu		$xa1,0x40($out)
 	vmovdqu		$xb1,0x50($out)
 	vmovdqu		$xc1,0x60($out)
 	vmovdqu		$xd1,0x70($out)
 	lea		0x80($out),$out		# size optimization
 	vmovdqu		$xa2,0x00($out)
 	vmovdqu		$xb2,0x10($out)
 	vmovdqu		$xc2,0x20($out)
 	vmovdqu		$xd2,0x30($out)
 	je		.Ldone4xop

 	lea		0x40($inp),$inp		# inp+=64*3
 	vmovdqa		$xa2,0x00(%rsp)
 	xor		%r10,%r10
 	vmovdqa		$xb2,0x10(%rsp)
 	lea		0x40($out),$out		# out+=64*3
 	vmovdqa		$xc2,0x20(%rsp)
 	sub		\$192,$len		# len-=64*3
 	vmovdqa		$xd2,0x30(%rsp)

 .Loop_tail4xop:
 	movzb		($inp,%r10),%eax
 	movzb		(%rsp,%r10),%ecx
 	lea		1(%r10),%r10
 	xor		%ecx,%eax
 	mov		%al,-1($out,%r10)
 	dec		$len
 	jnz		.Loop_tail4xop

 .Ldone4xop:
 	vzeroupper
 ___
 $code.=<<___	if ($win64);
 	lea		0x140+0x30(%rsp),%r11
 	movaps		-0x30(%r11),%xmm6
 	movaps		-0x20(%r11),%xmm7
 	movaps		-0x10(%r11),%xmm8
 	movaps		0x00(%r11),%xmm9
 	movaps		0x10(%r11),%xmm10
 	movaps		0x20(%r11),%xmm11
 	movaps		0x30(%r11),%xmm12
 	movaps		0x40(%r11),%xmm13
 	movaps		0x50(%r11),%xmm14
 	movaps		0x60(%r11),%xmm15
 ___
 $code.=<<___;
 	add		\$0x148+$xframe,%rsp
 	ret
 .size	ChaCha20_4xop,.-ChaCha20_4xop
 ___
 }

 ########################################################################
 # AVX2 code path
 if ($avx>1) {
--- a/crypto/chacha/chacha_generic.c
+++ b/crypto/chacha/chacha_generic.c
@@ -21,7 +21,49 @@
 #include <openssl/cpu.h>


 #if defined(OPENSSL_WINDOWS) || (!defined(OPENSSL_X86_64) && !defined(OPENSSL_X86)) || !defined(__SSE2__)
 #define U8TO32_LITTLE(p)                              \
  (((uint32_t)((p)[0])) | ((uint32_t)((p)[1]) << 8) | \
   ((uint32_t)((p)[2]) << 16) | ((uint32_t)((p)[3]) << 24))

 #if !defined(OPENSSL_NO_ASM) &&                         \
    (defined(OPENSSL_X86) || defined(OPENSSL_X86_64) || \
     defined(OPENSSL_ARM) || defined(OPENSSL_AARCH64))

 /* ChaCha20_ctr32 is defined in asm/chacha-*.pl. */
 void ChaCha20_ctr32(uint8_t *out, const uint8_t *in, size_t in_len,
                    const uint32_t key[8], const uint32_t counter[4]);

 void CRYPTO_chacha_20(uint8_t *out, const uint8_t *in, size_t in_len,
                      const uint8_t key[32], const uint8_t nonce[12],
                      uint32_t counter) {
  uint32_t counter_nonce[4];
  counter_nonce[0] = counter;
  counter_nonce[1] = U8TO32_LITTLE(nonce + 0);
  counter_nonce[2] = U8TO32_LITTLE(nonce + 4);
  counter_nonce[3] = U8TO32_LITTLE(nonce + 8);

  const uint32_t *key_ptr = (const uint32_t *)key;
 #if !defined(OPENSSL_X86) && !defined(OPENSSL_X86_64)
  /* The assembly expects the key to be four-byte aligned. */
  uint32_t key_u32[8];
  if ((((uintptr_t)key) & 3) != 0) {
    key_u32[0] = U8TO32_LITTLE(key + 0);
    key_u32[1] = U8TO32_LITTLE(key + 4);
    key_u32[2] = U8TO32_LITTLE(key + 8);
    key_u32[3] = U8TO32_LITTLE(key + 12);
    key_u32[4] = U8TO32_LITTLE(key + 16);
    key_u32[5] = U8TO32_LITTLE(key + 20);
    key_u32[6] = U8TO32_LITTLE(key + 24);
    key_u32[7] = U8TO32_LITTLE(key + 28);

    key_ptr = key_u32;
  }
 #endif

  ChaCha20_ctr32(out, in, in_len, key_ptr, counter_nonce);
 }

 #else

 /* sigma contains the ChaCha constants, which happen to be an ASCII string. */
 static const uint8_t sigma[16] = { 'e', 'x', 'p', 'a', 'n', 'd', ' ', '3',
@@ -40,10 +82,6 @@ static const uint8_t sigma[16] = { 'e', 'x', 'p', 'a', 'n', 'd', ' ', '3',
    (p)[3] = (v >> 24) & 0xff; \
  }

 #define U8TO32_LITTLE(p)                              \
  (((uint32_t)((p)[0])) | ((uint32_t)((p)[1]) << 8) | \
   ((uint32_t)((p)[2]) << 16) | ((uint32_t)((p)[3]) << 24))

 /* QUARTERROUND updates a, b, c, d with a ChaCha "quarter" round. */
 #define QUARTERROUND(a,b,c,d) \
  x[a] = PLUS(x[a],x[b]); x[d] = ROTATE(XOR(x[d],x[a]),16); \
@@ -51,13 +89,6 @@ static const uint8_t sigma[16] = { 'e', 'x', 'p', 'a', 'n', 'd', ' ', '3',
  x[a] = PLUS(x[a],x[b]); x[d] = ROTATE(XOR(x[d],x[a]), 8); \
  x[c] = PLUS(x[c],x[d]); x[b] = ROTATE(XOR(x[b],x[c]), 7);

 #if defined(OPENSSL_ARM) && !defined(OPENSSL_NO_ASM)
 /* Defined in chacha_vec.c */
 void CRYPTO_chacha_20_neon(uint8_t *out, const uint8_t *in, size_t in_len,
                           const uint8_t key[32], const uint8_t nonce[12],
                           uint32_t counter);
 #endif

 /* chacha_core performs 20 rounds of ChaCha on the input words in
 * |input| and writes the 64 output bytes to |output|. */
 static void chacha_core(uint8_t output[64], const uint32_t input[16]) {
@@ -91,13 +122,6 @@ void CRYPTO_chacha_20(uint8_t *out, const uint8_t *in, size_t in_len,
  uint8_t buf[64];
  size_t todo, i;

 #if defined(OPENSSL_ARM) && !defined(OPENSSL_NO_ASM)
  if (CRYPTO_is_NEON_capable()) {
    CRYPTO_chacha_20_neon(out, in, in_len, key, nonce, counter);
    return;
  }
 #endif

  input[0] = U8TO32_LITTLE(sigma + 0);
  input[1] = U8TO32_LITTLE(sigma + 4);
  input[2] = U8TO32_LITTLE(sigma + 8);
@@ -137,4 +161,4 @@ void CRYPTO_chacha_20(uint8_t *out, const uint8_t *in, size_t in_len,
  }
 }

 #endif /* OPENSSL_WINDOWS || !OPENSSL_X86_64 && !OPENSSL_X86 || !__SSE2__ */
 #endif
--- a/crypto/chacha/chacha_vec.c
+++ b/crypto/chacha/chacha_vec.c
@@ -1,328 +0,0 @@
 /* Copyright (c) 2014, Google Inc.
 *
 * Permission to use, copy, modify, and/or distribute this software for any
 * purpose with or without fee is hereby granted, provided that the above
 * copyright notice and this permission notice appear in all copies.
 *
 * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
 * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
 * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY
 * SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
 * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION
 * OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
 * CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. */

 /* ====================================================================
 *
 * When updating this file, also update chacha_vec_arm.S
 *
 * ==================================================================== */


 /* This implementation is by Ted Krovetz and was submitted to SUPERCOP and
 * marked as public domain. It was been altered to allow for non-aligned inputs
 * and to allow the block counter to be passed in specifically. */

 #include <openssl/chacha.h>

 #include "../internal.h"


 #if defined(ASM_GEN) ||          \
    !defined(OPENSSL_WINDOWS) && \
        (defined(OPENSSL_X86_64) || defined(OPENSSL_X86)) && defined(__SSE2__)

 #define CHACHA_RNDS 20 /* 8 (high speed), 20 (conservative), 12 (middle) */

 /* Architecture-neutral way to specify 16-byte vector of ints              */
 typedef unsigned vec __attribute__((vector_size(16)));

 /* This implementation is designed for Neon, SSE and AltiVec machines. The
 * following specify how to do certain vector operations efficiently on
 * each architecture, using intrinsics.
 * This implementation supports parallel processing of multiple blocks,
 * including potentially using general-purpose registers. */
 #if __ARM_NEON__
 #include <string.h>
 #include <arm_neon.h>
 #define GPR_TOO 1
 #define VBPI 2
 #define ONE (vec) vsetq_lane_u32(1, vdupq_n_u32(0), 0)
 #define LOAD_ALIGNED(m) (vec)(*((vec *)(m)))
 #define LOAD(m) ({ \
    memcpy(alignment_buffer, m, 16); \
    LOAD_ALIGNED(alignment_buffer); \
  })
 #define STORE(m, r) ({ \
    (*((vec *)(alignment_buffer))) = (r); \
    memcpy(m, alignment_buffer, 16); \
  })
 #define ROTV1(x) (vec) vextq_u32((uint32x4_t)x, (uint32x4_t)x, 1)
 #define ROTV2(x) (vec) vextq_u32((uint32x4_t)x, (uint32x4_t)x, 2)
 #define ROTV3(x) (vec) vextq_u32((uint32x4_t)x, (uint32x4_t)x, 3)
 #define ROTW16(x) (vec) vrev32q_u16((uint16x8_t)x)
 #if __clang__
 #define ROTW7(x) (x << ((vec) {7, 7, 7, 7})) ^ (x >> ((vec) {25, 25, 25, 25}))
 #define ROTW8(x) (x << ((vec) {8, 8, 8, 8})) ^ (x >> ((vec) {24, 24, 24, 24}))
 #define ROTW12(x) \
  (x << ((vec) {12, 12, 12, 12})) ^ (x >> ((vec) {20, 20, 20, 20}))
 #else
 #define ROTW7(x) \
  (vec) vsriq_n_u32(vshlq_n_u32((uint32x4_t)x, 7), (uint32x4_t)x, 25)
 #define ROTW8(x) \
  (vec) vsriq_n_u32(vshlq_n_u32((uint32x4_t)x, 8), (uint32x4_t)x, 24)
 #define ROTW12(x) \
  (vec) vsriq_n_u32(vshlq_n_u32((uint32x4_t)x, 12), (uint32x4_t)x, 20)
 #endif
 #elif __SSE2__
 #include <emmintrin.h>
 #define GPR_TOO 0
 #if __clang__
 #define VBPI 4
 #else
 #define VBPI 3
 #endif
 #define ONE (vec) _mm_set_epi32(0, 0, 0, 1)
 #define LOAD(m) (vec) _mm_loadu_si128((const __m128i *)(m))
 #define LOAD_ALIGNED(m) (vec) _mm_load_si128((const __m128i *)(m))
 #define STORE(m, r) _mm_storeu_si128((__m128i *)(m), (__m128i)(r))
 #define ROTV1(x) (vec) _mm_shuffle_epi32((__m128i)x, _MM_SHUFFLE(0, 3, 2, 1))
 #define ROTV2(x) (vec) _mm_shuffle_epi32((__m128i)x, _MM_SHUFFLE(1, 0, 3, 2))
 #define ROTV3(x) (vec) _mm_shuffle_epi32((__m128i)x, _MM_SHUFFLE(2, 1, 0, 3))
 #define ROTW7(x) \
  (vec)(_mm_slli_epi32((__m128i)x, 7) ^ _mm_srli_epi32((__m128i)x, 25))
 #define ROTW12(x) \
  (vec)(_mm_slli_epi32((__m128i)x, 12) ^ _mm_srli_epi32((__m128i)x, 20))
 #if __SSSE3__
 #include <tmmintrin.h>
 #define ROTW8(x)                                                            \
  (vec) _mm_shuffle_epi8((__m128i)x, _mm_set_epi8(14, 13, 12, 15, 10, 9, 8, \
                                                  11, 6, 5, 4, 7, 2, 1, 0, 3))
 #define ROTW16(x)                                                           \
  (vec) _mm_shuffle_epi8((__m128i)x, _mm_set_epi8(13, 12, 15, 14, 9, 8, 11, \
                                                  10, 5, 4, 7, 6, 1, 0, 3, 2))
 #else
 #define ROTW8(x) \
  (vec)(_mm_slli_epi32((__m128i)x, 8) ^ _mm_srli_epi32((__m128i)x, 24))
 #define ROTW16(x) \
  (vec)(_mm_slli_epi32((__m128i)x, 16) ^ _mm_srli_epi32((__m128i)x, 16))
 #endif
 #else
 #error-- Implementation supports only machines with neon or SSE2
 #endif

 #ifndef REVV_BE
 #define REVV_BE(x)  (x)
 #endif

 #ifndef REVW_BE
 #define REVW_BE(x)  (x)
 #endif

 #define BPI      (VBPI + GPR_TOO)  /* Blocks computed per loop iteration   */

 #define DQROUND_VECTORS(a,b,c,d)                \
    a += b; d ^= a; d = ROTW16(d);              \
    c += d; b ^= c; b = ROTW12(b);              \
    a += b; d ^= a; d = ROTW8(d);               \
    c += d; b ^= c; b = ROTW7(b);               \
    b = ROTV1(b); c = ROTV2(c);  d = ROTV3(d);  \
    a += b; d ^= a; d = ROTW16(d);              \
    c += d; b ^= c; b = ROTW12(b);              \
    a += b; d ^= a; d = ROTW8(d);               \
    c += d; b ^= c; b = ROTW7(b);               \
    b = ROTV3(b); c = ROTV2(c); d = ROTV1(d);

 #define QROUND_WORDS(a,b,c,d) \
  a = a+b; d ^= a; d = d<<16 | d>>16; \
  c = c+d; b ^= c; b = b<<12 | b>>20; \
  a = a+b; d ^= a; d = d<< 8 | d>>24; \
  c = c+d; b ^= c; b = b<< 7 | b>>25;

 #define WRITE_XOR(in, op, d, v0, v1, v2, v3)                   \
 	STORE(op + d + 0, LOAD(in + d + 0) ^ REVV_BE(v0));      \
 	STORE(op + d + 4, LOAD(in + d + 4) ^ REVV_BE(v1));      \
 	STORE(op + d + 8, LOAD(in + d + 8) ^ REVV_BE(v2));      \
 	STORE(op + d +12, LOAD(in + d +12) ^ REVV_BE(v3));

 #if __ARM_NEON__
 /* For ARM, we can't depend on NEON support, so this function is compiled with
 * a different name, along with the generic code, and can be enabled at
 * run-time. */
 void CRYPTO_chacha_20_neon(
 #else
 void CRYPTO_chacha_20(
 #endif
 	uint8_t *out,
 	const uint8_t *in,
 	size_t inlen,
 	const uint8_t key[32],
 	const uint8_t nonce[12],
 	uint32_t counter)
 	{
 	unsigned iters, i;
 	unsigned *op = (unsigned *)out;
 	const unsigned *ip = (const unsigned *)in;
 	const unsigned *kp = (const unsigned *)key;
 #if defined(__ARM_NEON__)
 	uint32_t np[3];
 	alignas(16) uint8_t alignment_buffer[16];
 #endif
 	vec s0, s1, s2, s3;
 	alignas(16) unsigned chacha_const[] =
 		{0x61707865,0x3320646E,0x79622D32,0x6B206574};
 #if defined(__ARM_NEON__)
 	memcpy(np, nonce, 12);
 #endif
 	s0 = LOAD_ALIGNED(chacha_const);
 	s1 = LOAD(&((const vec*)kp)[0]);
 	s2 = LOAD(&((const vec*)kp)[1]);
 	s3 = (vec){
 		counter,
 		((const uint32_t*)nonce)[0],
 		((const uint32_t*)nonce)[1],
 		((const uint32_t*)nonce)[2]
 	};

 	for (iters = 0; iters < inlen/(BPI*64); iters++)
 		{
 #if GPR_TOO
 		register unsigned x0, x1, x2, x3, x4, x5, x6, x7, x8,
 				  x9, x10, x11, x12, x13, x14, x15;
 #endif
 #if VBPI > 2
 		vec v8,v9,v10,v11;
 #endif
 #if VBPI > 3
 		vec v12,v13,v14,v15;
 #endif

 		vec v0,v1,v2,v3,v4,v5,v6,v7;
 		v4 = v0 = s0; v5 = v1 = s1; v6 = v2 = s2; v3 = s3;
 		v7 = v3 + ONE;
 #if VBPI > 2
 		v8 = v4; v9 = v5; v10 = v6;
 		v11 =  v7 + ONE;
 #endif
 #if VBPI > 3
 		v12 = v8; v13 = v9; v14 = v10;
 		v15 = v11 + ONE;
 #endif
 #if GPR_TOO
 		x0 = chacha_const[0]; x1 = chacha_const[1];
 		x2 = chacha_const[2]; x3 = chacha_const[3];
 		x4 = kp[0]; x5 = kp[1]; x6  = kp[2]; x7  = kp[3];
 		x8 = kp[4]; x9 = kp[5]; x10 = kp[6]; x11 = kp[7];
 		x12 = counter+BPI*iters+(BPI-1); x13 = np[0];
 		x14 = np[1]; x15 = np[2];
 #endif
 		for (i = CHACHA_RNDS/2; i; i--)
 			{
 			DQROUND_VECTORS(v0,v1,v2,v3)
 			DQROUND_VECTORS(v4,v5,v6,v7)
 #if VBPI > 2
 			DQROUND_VECTORS(v8,v9,v10,v11)
 #endif
 #if VBPI > 3
 			DQROUND_VECTORS(v12,v13,v14,v15)
 #endif
 #if GPR_TOO
 			QROUND_WORDS( x0, x4, x8,x12)
 			QROUND_WORDS( x1, x5, x9,x13)
 			QROUND_WORDS( x2, x6,x10,x14)
 			QROUND_WORDS( x3, x7,x11,x15)
 			QROUND_WORDS( x0, x5,x10,x15)
 			QROUND_WORDS( x1, x6,x11,x12)
 			QROUND_WORDS( x2, x7, x8,x13)
 			QROUND_WORDS( x3, x4, x9,x14)
 #endif
 			}

 		WRITE_XOR(ip, op, 0, v0+s0, v1+s1, v2+s2, v3+s3)
 		s3 += ONE;
 		WRITE_XOR(ip, op, 16, v4+s0, v5+s1, v6+s2, v7+s3)
 		s3 += ONE;
 #if VBPI > 2
 		WRITE_XOR(ip, op, 32, v8+s0, v9+s1, v10+s2, v11+s3)
 		s3 += ONE;
 #endif
 #if VBPI > 3
 		WRITE_XOR(ip, op, 48, v12+s0, v13+s1, v14+s2, v15+s3)
 		s3 += ONE;
 #endif
 		ip += VBPI*16;
 		op += VBPI*16;
 #if GPR_TOO
 		op[0]  = REVW_BE(REVW_BE(ip[0])  ^ (x0  + chacha_const[0]));
 		op[1]  = REVW_BE(REVW_BE(ip[1])  ^ (x1  + chacha_const[1]));
 		op[2]  = REVW_BE(REVW_BE(ip[2])  ^ (x2  + chacha_const[2]));
 		op[3]  = REVW_BE(REVW_BE(ip[3])  ^ (x3  + chacha_const[3]));
 		op[4]  = REVW_BE(REVW_BE(ip[4])  ^ (x4  + kp[0]));
 		op[5]  = REVW_BE(REVW_BE(ip[5])  ^ (x5  + kp[1]));
 		op[6]  = REVW_BE(REVW_BE(ip[6])  ^ (x6  + kp[2]));
 		op[7]  = REVW_BE(REVW_BE(ip[7])  ^ (x7  + kp[3]));
 		op[8]  = REVW_BE(REVW_BE(ip[8])  ^ (x8  + kp[4]));
 		op[9]  = REVW_BE(REVW_BE(ip[9])  ^ (x9  + kp[5]));
 		op[10] = REVW_BE(REVW_BE(ip[10]) ^ (x10 + kp[6]));
 		op[11] = REVW_BE(REVW_BE(ip[11]) ^ (x11 + kp[7]));
 		op[12] = REVW_BE(REVW_BE(ip[12]) ^ (x12 + counter+BPI*iters+(BPI-1)));
 		op[13] = REVW_BE(REVW_BE(ip[13]) ^ (x13 + np[0]));
 		op[14] = REVW_BE(REVW_BE(ip[14]) ^ (x14 + np[1]));
 		op[15] = REVW_BE(REVW_BE(ip[15]) ^ (x15 + np[2]));
 		s3 += ONE;
 		ip += 16;
 		op += 16;
 #endif
 		}

 	for (iters = inlen%(BPI*64)/64; iters != 0; iters--)
 		{
 		vec v0 = s0, v1 = s1, v2 = s2, v3 = s3;
 		for (i = CHACHA_RNDS/2; i; i--)
 			{
 			DQROUND_VECTORS(v0,v1,v2,v3);
 			}
 		WRITE_XOR(ip, op, 0, v0+s0, v1+s1, v2+s2, v3+s3)
 		s3 += ONE;
 		ip += 16;
 		op += 16;
 		}

 	inlen = inlen % 64;
 	if (inlen)
 		{
 		alignas(16) vec buf[4];
 		vec v0,v1,v2,v3;
 		v0 = s0; v1 = s1; v2 = s2; v3 = s3;
 		for (i = CHACHA_RNDS/2; i; i--)
 			{
 			DQROUND_VECTORS(v0,v1,v2,v3);
 			}

 		if (inlen >= 16)
 			{
 			STORE(op + 0, LOAD(ip + 0) ^ REVV_BE(v0 + s0));
 			if (inlen >= 32)
 				{
 				STORE(op + 4, LOAD(ip + 4) ^ REVV_BE(v1 + s1));
 				if (inlen >= 48)
 					{
 					STORE(op + 8, LOAD(ip +  8) ^
 						      REVV_BE(v2 + s2));
 					buf[3] = REVV_BE(v3 + s3);
 					}
 				else
 					buf[2] = REVV_BE(v2 + s2);
 				}
 			else
 				buf[1] = REVV_BE(v1 + s1);
 			}
 		else
 			buf[0] = REVV_BE(v0 + s0);

 		for (i=inlen & ~15; i<inlen; i++)
 			((char *)op)[i] = ((const char *)ip)[i] ^ ((const char *)buf)[i];
 		}
 	}

 #endif /* ASM_GEN || !OPENSSL_WINDOWS && (OPENSSL_X86_64 || OPENSSL_X86) && SSE2 */
--- a/crypto/chacha/chacha_vec_arm.S
+++ b/crypto/chacha/chacha_vec_arm.S
--- a/crypto/chacha/chacha_vec_arm_generate.go
+++ b/crypto/chacha/chacha_vec_arm_generate.go
@@ -1,153 +0,0 @@
 // Copyright (c) 2014, Google Inc.
 //
 // Permission to use, copy, modify, and/or distribute this software for any
 // purpose with or without fee is hereby granted, provided that the above
 // copyright notice and this permission notice appear in all copies.
 //
 // THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
 // WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
 // MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY
 // SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
 // WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION
 // OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
 // CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

 // This package generates chacha_vec_arm.S from chacha_vec.c. Install the
 // arm-linux-gnueabihf-gcc compiler as described in BUILDING.md. Then:
 // `(cd crypto/chacha && go run chacha_vec_arm_generate.go)`.

 package main

 import (
 	"bufio"
 	"bytes"
 	"os"
 	"os/exec"
 	"strings"
 )

 const defaultCompiler = "/opt/gcc-linaro-4.9-2014.11-x86_64_arm-linux-gnueabihf/bin/arm-linux-gnueabihf-gcc"

 func main() {
 	compiler := defaultCompiler
 	if len(os.Args) > 1 {
 		compiler = os.Args[1]
 	}

 	args := []string{
 		"-O3",
 		"-mcpu=cortex-a8",
 		"-mfpu=neon",
 		"-fpic",
 		"-DASM_GEN",
 		"-I", "../../include",
 		"-S", "chacha_vec.c",
 		"-o", "-",
 	}

 	output, err := os.OpenFile("chacha_vec_arm.S", os.O_CREATE|os.O_TRUNC|os.O_WRONLY, 0644)
 	if err != nil {
 		panic(err)
 	}
 	defer output.Close()

 	output.WriteString(preamble)
 	output.WriteString(compiler)
 	output.WriteString(" ")
 	output.WriteString(strings.Join(args, " "))
 	output.WriteString("\n\n#if !defined(OPENSSL_NO_ASM)\n")
 	output.WriteString("#if defined(__arm__)\n\n")

 	cmd := exec.Command(compiler, args...)
 	cmd.Stderr = os.Stderr
 	asm, err := cmd.StdoutPipe()
 	if err != nil {
 		panic(err)
 	}
 	if err := cmd.Start(); err != nil {
 		panic(err)
 	}

 	attr28 := []byte(".eabi_attribute 28,")
 	globalDirective := []byte(".global\t")
 	newLine := []byte("\n")
 	attr28Handled := false

 	scanner := bufio.NewScanner(asm)
 	for scanner.Scan() {
 		line := scanner.Bytes()

 		if bytes.Contains(line, attr28) {
 			output.WriteString(attr28Block)
 			attr28Handled = true
 			continue
 		}

 		output.Write(line)
 		output.Write(newLine)

 		if i := bytes.Index(line, globalDirective); i >= 0 {
 			output.Write(line[:i])
 			output.WriteString(".hidden\t")
 			output.Write(line[i+len(globalDirective):])
 			output.Write(newLine)
 		}
 	}

 	if err := scanner.Err(); err != nil {
 		panic(err)
 	}

 	if !attr28Handled {
 		panic("EABI attribute 28 not seen in processing")
 	}

 	if err := cmd.Wait(); err != nil {
 		panic(err)
 	}

 	output.WriteString(trailer)
 }

 const preamble = `# Copyright (c) 2014, Google Inc.
 #
 # Permission to use, copy, modify, and/or distribute this software for any
 # purpose with or without fee is hereby granted, provided that the above
 # copyright notice and this permission notice appear in all copies.
 #
 # THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
 # WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
 # MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY
 # SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
 # WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION
 # OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN
 # CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

 # This file contains a pre-compiled version of chacha_vec.c for ARM. This is
 # needed to support switching on NEON code at runtime. If the whole of OpenSSL
 # were to be compiled with the needed flags to build chacha_vec.c, then it
 # wouldn't be possible to run on non-NEON systems.
 #
 # This file was generated by chacha_vec_arm_generate.go using the following
 # compiler command:
 #
 #     `

 const attr28Block = `
 # EABI attribute 28 sets whether VFP register arguments were used to build this
 # file. If object files are inconsistent on this point, the linker will refuse
 # to link them. Thus we report whatever the compiler expects since we don't use
 # VFP arguments.

 #if defined(__ARM_PCS_VFP)
 	.eabi_attribute 28, 1
 #else
 	.eabi_attribute 28, 0
 #endif

 `

 const trailer = `
 #endif  /* __arm__ */
 #endif  /* !OPENSSL_NO_ASM */
 `
--- a/util/generate_build_files.py
+++ b/util/generate_build_files.py
@@ -39,7 +39,6 @@ OS_ARCH_COMBOS = [
 # perlasm system.
 NON_PERL_FILES = {
    ('linux', 'arm'): [
        'src/crypto/chacha/chacha_vec_arm.S',
        'src/crypto/cpu-arm-asm.S',
        'src/crypto/curve25519/asm/x25519-asm-arm.S',
        'src/crypto/poly1305/poly1305_arm_asm.S',