Add an optimized x86_64 vpaes ctr128_f and remove bsaes.

Brian Smith suggested applying vpaes-armv8's "2x" optimization to vpaes-x86_64. The registers are a little tight (aarch64 has a whole 32 SIMD registers, while x86_64 only has 16), but it's doable with some spills and makes vpaes much more competitive with bsaes. At small- and medium-sized inputs, vpaes now matches bsaes. At large inputs, it's a ~10% perf hit. bsaes is thus pulling much less weight. Losing an entire AES implementation and having constant-time AES for SSSE3 is attractive. Some notes: - The fact that these are older CPUs tempers the perf hit, but CPUs without AES-NI are still common enough to matter. - This CL does regress CBC decrypt performance nontrivially (see below). If this matters, we can double-up CBC decryption too. CBC in TLS is legacy and already pays a costly Lucky13 mitigation. - The difference between 1350 and 8192 bytes is likely bsaes AES-GCM paying for two slow (and variable-time!) aes_nohw_encrypt calls for EK0 and the trailing partial block. At larger inputs, those two calls are more amortized. - To that end, bsaes would likely be much faster on AES-GCM with smarter use of bsaes. (Fold one-off calls above into bulk data.) Implementing this is a bit of a nuisance though, especially considering we don't wish to regress hwaes. - I'd discarded the key conversion idea, but I think I did it wrong. Benchmarks from https://boringssl-review.googlesource.com/c/boringssl/+/33589 suggest converting to bsaes format on-demand for large ctr32 inputs should give the best of both worlds, but at the cost of an entire AES implementation relative to this CL. - ARMv7 still depends on bsaes and has no vpaes. It also has 16 SIMD registers, so my plan is to translate it, with the same 2x optimization, and see how it compares. Hopefully that, or some combination of the above, will work for ARMv7. Sandy Bridge bsaes (before): Did 3144750 AES-128-GCM (16 bytes) seal operations in 5016000us (626943.8 ops/sec): 10.0 MB/s Did 2053750 AES-128-GCM (256 bytes) seal operations in 5016000us (409439.8 ops/sec): 104.8 MB/s Did 469000 AES-128-GCM (1350 bytes) seal operations in 5015000us (93519.4 ops/sec): 126.3 MB/s Did 92500 AES-128-GCM (8192 bytes) seal operations in 5016000us (18441.0 ops/sec): 151.1 MB/s Did 46750 AES-128-GCM (16384 bytes) seal operations in 5032000us (9290.5 ops/sec): 152.2 MB/s vpaes-1x (for reference, not this CL): Did 8684750 AES-128-GCM (16 bytes) seal operations in 5015000us (1731754.7 ops/sec): 27.7 MB/s [+177%] Did 1731500 AES-128-GCM (256 bytes) seal operations in 5016000us (345195.4 ops/sec): 88.4 MB/s [-15.6%] Did 346500 AES-128-GCM (1350 bytes) seal operations in 5016000us (69078.9 ops/sec): 93.3 MB/s [-26.1%] Did 61250 AES-128-GCM (8192 bytes) seal operations in 5015000us (12213.4 ops/sec): 100.1 MB/s [-33.8%] Did 32500 AES-128-GCM (16384 bytes) seal operations in 5031000us (6459.9 ops/sec): 105.8 MB/s [-30.5%] vpaes-2x (this CL): Did 8840000 AES-128-GCM (16 bytes) seal operations in 5015000us (1762711.9 ops/sec): 28.2 MB/s [+182%] Did 2167750 AES-128-GCM (256 bytes) seal operations in 5016000us (432167.1 ops/sec): 110.6 MB/s [+5.5%] Did 474000 AES-128-GCM (1350 bytes) seal operations in 5016000us (94497.6 ops/sec): 127.6 MB/s [+1.0%] Did 81750 AES-128-GCM (8192 bytes) seal operations in 5015000us (16301.1 ops/sec): 133.5 MB/s [-11.6%] Did 41750 AES-128-GCM (16384 bytes) seal operations in 5031000us (8298.5 ops/sec): 136.0 MB/s [-10.6%] Penryn bsaes (before): Did 958000 AES-128-GCM (16 bytes) seal operations in 1000264us (957747.2 ops/sec): 15.3 MB/s Did 420000 AES-128-GCM (256 bytes) seal operations in 1000480us (419798.5 ops/sec): 107.5 MB/s Did 96000 AES-128-GCM (1350 bytes) seal operations in 1001083us (95896.1 ops/sec): 129.5 MB/s Did 18000 AES-128-GCM (8192 bytes) seal operations in 1042491us (17266.3 ops/sec): 141.4 MB/s Did 9482 AES-128-GCM (16384 bytes) seal operations in 1095703us (8653.8 ops/sec): 141.8 MB/s Did 758000 AES-256-GCM (16 bytes) seal operations in 1000769us (757417.5 ops/sec): 12.1 MB/s Did 359000 AES-256-GCM (256 bytes) seal operations in 1001993us (358285.9 ops/sec): 91.7 MB/s Did 82000 AES-256-GCM (1350 bytes) seal operations in 1009583us (81221.7 ops/sec): 109.6 MB/s Did 15000 AES-256-GCM (8192 bytes) seal operations in 1022294us (14672.9 ops/sec): 120.2 MB/s Did 7884 AES-256-GCM (16384 bytes) seal operations in 1070934us (7361.8 ops/sec): 120.6 MB/s vpaes-1x (for reference, not this CL): Did 2030000 AES-128-GCM (16 bytes) seal operations in 1000227us (2029539.3 ops/sec): 32.5 MB/s [+112%] Did 382000 AES-128-GCM (256 bytes) seal operations in 1001949us (381256.9 ops/sec): 97.6 MB/s [-9.2%] Did 81000 AES-128-GCM (1350 bytes) seal operations in 1007297us (80413.2 ops/sec): 108.6 MB/s [-16.1%] Did 14000 AES-128-GCM (8192 bytes) seal operations in 1031499us (13572.5 ops/sec): 111.2 MB/s [-21.4%] Did 7008 AES-128-GCM (16384 bytes) seal operations in 1030706us (6799.2 ops/sec): 111.4 MB/s [-21.4%] Did 1838000 AES-256-GCM (16 bytes) seal operations in 1000238us (1837562.7 ops/sec): 29.4 MB/s [+143%] Did 321000 AES-256-GCM (256 bytes) seal operations in 1001666us (320466.1 ops/sec): 82.0 MB/s [-10.6%] Did 67000 AES-256-GCM (1350 bytes) seal operations in 1010359us (66313.1 ops/sec): 89.5 MB/s [-18.3%] Did 12000 AES-256-GCM (8192 bytes) seal operations in 1072706us (11186.7 ops/sec): 91.6 MB/s [-23.8%] Did 5680 AES-256-GCM (16384 bytes) seal operations in 1009214us (5628.1 ops/sec): 92.2 MB/s [-23.5%] vpaes-2x (this CL): Did 2072000 AES-128-GCM (16 bytes) seal operations in 1000066us (2071863.3 ops/sec): 33.1 MB/s [+116%] Did 432000 AES-128-GCM (256 bytes) seal operations in 1000732us (431684.0 ops/sec): 110.5 MB/s [+2.8%] Did 92000 AES-128-GCM (1350 bytes) seal operations in 1000580us (91946.7 ops/sec): 124.1 MB/s [-4.2%] Did 16000 AES-128-GCM (8192 bytes) seal operations in 1016422us (15741.5 ops/sec): 129.0 MB/s [-8.8%] Did 8448 AES-128-GCM (16384 bytes) seal operations in 1073962us (7866.2 ops/sec): 128.9 MB/s [-9.1%] Did 1865000 AES-256-GCM (16 bytes) seal operations in 1000043us (1864919.8 ops/sec): 29.8 MB/s [+146%] Did 364000 AES-256-GCM (256 bytes) seal operations in 1001561us (363432.7 ops/sec): 93.0 MB/s [+1.4%] Did 77000 AES-256-GCM (1350 bytes) seal operations in 1004123us (76683.8 ops/sec): 103.5 MB/s [-5.6%] Did 14000 AES-256-GCM (8192 bytes) seal operations in 1071179us (13069.7 ops/sec): 107.1 MB/s [-10.9%] Did 7008 AES-256-GCM (16384 bytes) seal operations in 1074125us (6524.4 ops/sec): 106.9 MB/s [-11.4%] Penryn, CBC mode decryption bsaes (before): Did 159000 AES-128-CBC-SHA1 (16 bytes) open operations in 1001019us (158838.1 ops/sec): 2.5 MB/s Did 114000 AES-128-CBC-SHA1 (256 bytes) open operations in 1006485us (113265.5 ops/sec): 29.0 MB/s Did 65000 AES-128-CBC-SHA1 (1350 bytes) open operations in 1008441us (64455.9 ops/sec): 87.0 MB/s Did 17000 AES-128-CBC-SHA1 (8192 bytes) open operations in 1005440us (16908.0 ops/sec): 138.5 MB/s vpaes (after): Did 167000 AES-128-CBC-SHA1 (16 bytes) open operations in 1003556us (166408.3 ops/sec): 2.7 MB/s [+8%] Did 112000 AES-128-CBC-SHA1 (256 bytes) open operations in 1005673us (111368.2 ops/sec): 28.5 MB/s [-1.7%] Did 56000 AES-128-CBC-SHA1 (1350 bytes) open operations in 1005647us (55685.5 ops/sec): 75.2 MB/s [-13.6%] Did 13635 AES-128-CBC-SHA1 (8192 bytes) open operations in 1020486us (13361.3 ops/sec): 109.5 MB/s [-20.9%] Bug: 256 Change-Id: I11ed773323ec7a5ee61080c9ed9ed4761849828a Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/35364 Commit-Queue: David Benjamin <davidben@google.com> Reviewed-by: Adam Langley <agl@google.com>
2019-03-19 21:59:49 -05:00 · 2019-03-19 21:59:49 -05:00 · 32ce6032ff
commit 32ce6032ff
parent 5501a26915
4 changed files with 302 additions and 3234 deletions
--- a/crypto/fipsmodule/CMakeLists.txt
+++ b/crypto/fipsmodule/CMakeLists.txt
@ -7,7 +7,6 @@ if(${ARCH} STREQUAL "x86_64")
    aesni-gcm-x86_64.${ASM_EXT}
    aesni-x86_64.${ASM_EXT}
    aes-x86_64.${ASM_EXT}
-    bsaes-x86_64.${ASM_EXT}
    ghash-ssse3-x86_64.${ASM_EXT}
    ghash-x86_64.${ASM_EXT}
    md5-x86_64.${ASM_EXT}
@ -95,7 +94,6 @@ perlasm(armv4-mont.${ASM_EXT} bn/asm/armv4-mont.pl)
 perlasm(armv8-mont.${ASM_EXT} bn/asm/armv8-mont.pl)
 perlasm(bn-586.${ASM_EXT} bn/asm/bn-586.pl)
 perlasm(bsaes-armv7.${ASM_EXT} aes/asm/bsaes-armv7.pl)
-perlasm(bsaes-x86_64.${ASM_EXT} aes/asm/bsaes-x86_64.pl)
 perlasm(co-586.${ASM_EXT} bn/asm/co-586.pl)
 perlasm(ghash-armv4.${ASM_EXT} modes/asm/ghash-armv4.pl)
 perlasm(ghashp8-ppc.${ASM_EXT} modes/asm/ghashp8-ppc.pl)
--- a/crypto/fipsmodule/aes/asm/bsaes-x86_64.pl
+++ b/crypto/fipsmodule/aes/asm/bsaes-x86_64.pl
--- a/crypto/fipsmodule/aes/asm/vpaes-x86_64.pl
+++ b/crypto/fipsmodule/aes/asm/vpaes-x86_64.pl
@ -175,6 +175,181 @@ _vpaes_encrypt_core:
 .cfi_endproc
 .size	_vpaes_encrypt_core,.-_vpaes_encrypt_core

+##
+##  _aes_encrypt_core_2x
+##
+##  AES-encrypt %xmm0 and %xmm6 in parallel.
+##
+##  Inputs:
+##     %xmm0 and %xmm6 = input
+##     %xmm12-%xmm15 as in _vpaes_preheat
+##    (%rdx) = scheduled keys
+##
+##  Output in %xmm0 and %xmm6
+##  Clobbers  %xmm1-%xmm5, %xmm7-%xmm11, %r9, %r10, %r11, %rax
+##  Preserves %xmm14 and %xmm15
+##
+##  This function stitches two parallel instances of _vpaes_encrypt_core. x86_64
+##  provides 16 XMM registers. _vpaes_encrypt_core computes over six registers
+##  (%xmm0-%xmm5) and additionally uses seven registers with preloaded constants
+##  from _vpaes_preheat (%xmm9-%xmm15). This does not quite fit two instances,
+##  so we spill some of %xmm9 through %xmm15 back to memory. We keep %xmm9 and
+##  %xmm10 in registers as these values are used several times in a row. The
+##  remainder are read once per round and are spilled to memory. This leaves two
+##  registers preserved for the caller.
+##
+##  Thus, of the two _vpaes_encrypt_core instances, the first uses (%xmm0-%xmm5)
+##  as before. The second uses %xmm6-%xmm8,%xmm11-%xmm13. (Add 6 to %xmm2 and
+##  below. Add 8 to %xmm3 and up.) Instructions in the second instance are
+##  indented by one space.
+##
+##
+.type	_vpaes_encrypt_core_2x,\@abi-omnipotent
+.align 16
+_vpaes_encrypt_core_2x:
+.cfi_startproc
+	mov	%rdx,	%r9
+	mov	\$16,	%r11
+	mov	240(%rdx),%eax
+	movdqa	%xmm9,	%xmm1
+	 movdqa	%xmm9,	%xmm7
+	movdqa	.Lk_ipt(%rip), %xmm2	# iptlo
+	 movdqa	%xmm2,	%xmm8
+	pandn	%xmm0,	%xmm1
+	 pandn	%xmm6,	%xmm7
+	movdqu	(%r9),	%xmm5		# round0 key
+	 # Also use %xmm5 in the second instance.
+	psrld	\$4,	%xmm1
+	 psrld	\$4,	%xmm7
+	pand	%xmm9,	%xmm0
+	 pand	%xmm9,	%xmm6
+	pshufb	%xmm0,	%xmm2
+	 pshufb	%xmm6,	%xmm8
+	movdqa	.Lk_ipt+16(%rip), %xmm0	# ipthi
+	 movdqa	%xmm0,	%xmm6
+	pshufb	%xmm1,	%xmm0
+	 pshufb	%xmm7,	%xmm6
+	pxor	%xmm5,	%xmm2
+	 pxor	%xmm5,	%xmm8
+	add	\$16,	%r9
+	pxor	%xmm2,	%xmm0
+	 pxor	%xmm8,	%xmm6
+	lea	.Lk_mc_backward(%rip),%r10
+	jmp	.Lenc2x_entry
+
+.align 16
+.Lenc2x_loop:
+	# middle of middle round
+	movdqa  .Lk_sb1(%rip),	%xmm4		# 4 : sb1u
+	movdqa  .Lk_sb1+16(%rip),%xmm0		# 0 : sb1t
+	 movdqa	%xmm4,	%xmm12
+	 movdqa	%xmm0,	%xmm6
+	pshufb  %xmm2,	%xmm4			# 4 = sb1u
+	 pshufb	%xmm8,	%xmm12
+	pshufb  %xmm3,	%xmm0			# 0 = sb1t
+	 pshufb	%xmm11,	%xmm6
+	pxor	%xmm5,	%xmm4			# 4 = sb1u + k
+	 pxor	%xmm5,	%xmm12
+	movdqa  .Lk_sb2(%rip),	%xmm5		# 4 : sb2u
+	 movdqa	%xmm5,	%xmm13
+	pxor	%xmm4,	%xmm0			# 0 = A
+	 pxor	%xmm12,	%xmm6
+	movdqa	-0x40(%r11,%r10), %xmm1		# .Lk_mc_forward[]
+	 # Also use %xmm1 in the second instance.
+	pshufb	%xmm2,	%xmm5			# 4 = sb2u
+	 pshufb	%xmm8,	%xmm13
+	movdqa	(%r11,%r10), %xmm4		# .Lk_mc_backward[]
+	 # Also use %xmm4 in the second instance.
+	movdqa	.Lk_sb2+16(%rip), %xmm2		# 2 : sb2t
+	 movdqa	%xmm2,	%xmm8
+	pshufb	%xmm3,  %xmm2			# 2 = sb2t
+	 pshufb	%xmm11,	%xmm8
+	movdqa	%xmm0,  %xmm3			# 3 = A
+	 movdqa	%xmm6,	%xmm11
+	pxor	%xmm5,	%xmm2			# 2 = 2A
+	 pxor	%xmm13,	%xmm8
+	pshufb  %xmm1,  %xmm0			# 0 = B
+	 pshufb	%xmm1,	%xmm6
+	add	\$16,	%r9			# next key
+	pxor	%xmm2,  %xmm0			# 0 = 2A+B
+	 pxor	%xmm8,	%xmm6
+	pshufb	%xmm4,	%xmm3			# 3 = D
+	 pshufb	%xmm4,	%xmm11
+	add	\$16,	%r11			# next mc
+	pxor	%xmm0,	%xmm3			# 3 = 2A+B+D
+	 pxor	%xmm6,	%xmm11
+	pshufb  %xmm1,	%xmm0			# 0 = 2B+C
+	 pshufb	%xmm1,	%xmm6
+	and	\$0x30,	%r11			# ... mod 4
+	sub	\$1,%rax			# nr--
+	pxor	%xmm3,	%xmm0			# 0 = 2A+3B+C+D
+	 pxor	%xmm11,	%xmm6
+
+.Lenc2x_entry:
+	# top of round
+	movdqa  %xmm9, 	%xmm1	# 1 : i
+	 movdqa	%xmm9,	%xmm7
+	movdqa	.Lk_inv+16(%rip), %xmm5	# 2 : a/k
+	 movdqa	%xmm5,	%xmm13
+	pandn	%xmm0, 	%xmm1	# 1 = i<<4
+	 pandn	%xmm6,	%xmm7
+	psrld	\$4,   	%xmm1   # 1 = i
+	 psrld	\$4,	%xmm7
+	pand	%xmm9, 	%xmm0   # 0 = k
+	 pand	%xmm9,	%xmm6
+	pshufb  %xmm0,  %xmm5	# 2 = a/k
+	 pshufb	%xmm6,	%xmm13
+	movdqa	%xmm10,	%xmm3  	# 3 : 1/i
+	 movdqa	%xmm10,	%xmm11
+	pxor	%xmm1,	%xmm0	# 0 = j
+	 pxor	%xmm7,	%xmm6
+	pshufb  %xmm1, 	%xmm3  	# 3 = 1/i
+	 pshufb	%xmm7,	%xmm11
+	movdqa	%xmm10,	%xmm4  	# 4 : 1/j
+	 movdqa	%xmm10,	%xmm12
+	pxor	%xmm5, 	%xmm3  	# 3 = iak = 1/i + a/k
+	 pxor	%xmm13,	%xmm11
+	pshufb	%xmm0, 	%xmm4  	# 4 = 1/j
+	 pshufb	%xmm6,	%xmm12
+	movdqa	%xmm10,	%xmm2  	# 2 : 1/iak
+	 movdqa	%xmm10,	%xmm8
+	pxor	%xmm5, 	%xmm4  	# 4 = jak = 1/j + a/k
+	 pxor	%xmm13,	%xmm12
+	pshufb  %xmm3,	%xmm2  	# 2 = 1/iak
+	 pshufb	%xmm11,	%xmm8
+	movdqa	%xmm10, %xmm3   # 3 : 1/jak
+	 movdqa	%xmm10,	%xmm11
+	pxor	%xmm0, 	%xmm2  	# 2 = io
+	 pxor	%xmm6,	%xmm8
+	pshufb  %xmm4,  %xmm3   # 3 = 1/jak
+	 pshufb	%xmm12,	%xmm11
+	movdqu	(%r9),	%xmm5
+	 # Also use %xmm5 in the second instance.
+	pxor	%xmm1,  %xmm3   # 3 = jo
+	 pxor	%xmm7,	%xmm11
+	jnz	.Lenc2x_loop
+
+	# middle of last round
+	movdqa	-0x60(%r10), %xmm4	# 3 : sbou	.Lk_sbo
+	movdqa	-0x50(%r10), %xmm0	# 0 : sbot	.Lk_sbo+16
+	 movdqa	%xmm4,	%xmm12
+	 movdqa	%xmm0,	%xmm6
+	pshufb  %xmm2,  %xmm4	# 4 = sbou
+	 pshufb	%xmm8,	%xmm12
+	pxor	%xmm5,  %xmm4	# 4 = sb1u + k
+	 pxor	%xmm5,	%xmm12
+	pshufb  %xmm3,	%xmm0	# 0 = sb1t
+	 pshufb	%xmm11,	%xmm6
+	movdqa	0x40(%r11,%r10), %xmm1		# .Lk_sr[]
+	 # Also use %xmm1 in the second instance.
+	pxor	%xmm4,	%xmm0	# 0 = A
+	 pxor	%xmm12,	%xmm6
+	pshufb	%xmm1,	%xmm0
+	 pshufb	%xmm1,	%xmm6
+	ret
+.cfi_endproc
+.size	_vpaes_encrypt_core_2x,.-_vpaes_encrypt_core_2x
+
 ##
 ##  Decryption core
 ##
@ -984,6 +1159,111 @@ $code.=<<___;
 .size	${PREFIX}_cbc_encrypt,.-${PREFIX}_cbc_encrypt
 ___
 }
+{
+my ($inp,$out,$blocks,$key,$ivp)=("%rdi","%rsi","%rdx","%rcx","%r8");
+# void vpaes_ctr32_encrypt_blocks(const uint8_t *inp, uint8_t *out,
+#                                 size_t blocks, const AES_KEY *key,
+#                                 const uint8_t ivp[16]);
+$code.=<<___;
+.globl	${PREFIX}_ctr32_encrypt_blocks
+.type	${PREFIX}_ctr32_encrypt_blocks,\@function,5
+.align	16
+${PREFIX}_ctr32_encrypt_blocks:
+.cfi_startproc
+	# _vpaes_encrypt_core and _vpaes_encrypt_core_2x expect the key in %rdx.
+	xchg	$key, $blocks
+___
+($blocks,$key)=($key,$blocks);
+$code.=<<___;
+	test	$blocks, $blocks
+	jz	.Lctr32_abort
+___
+$code.=<<___ if ($win64);
+	lea	-0xb8(%rsp),%rsp
+	movaps	%xmm6,0x10(%rsp)
+	movaps	%xmm7,0x20(%rsp)
+	movaps	%xmm8,0x30(%rsp)
+	movaps	%xmm9,0x40(%rsp)
+	movaps	%xmm10,0x50(%rsp)
+	movaps	%xmm11,0x60(%rsp)
+	movaps	%xmm12,0x70(%rsp)
+	movaps	%xmm13,0x80(%rsp)
+	movaps	%xmm14,0x90(%rsp)
+	movaps	%xmm15,0xa0(%rsp)
+.Lctr32_body:
+___
+$code.=<<___;
+	movdqu	($ivp), %xmm0		# Load IV.
+	movdqa	.Lctr_add_one(%rip), %xmm8
+	sub	$inp, $out		# This allows only incrementing $inp.
+	call	_vpaes_preheat
+	movdqa	%xmm0, %xmm6
+	pshufb	.Lrev_ctr(%rip), %xmm6
+
+	test	\$1, $blocks
+	jz	.Lctr32_prep_loop
+
+	# Handle one block so the remaining block count is even for
+	# _vpaes_encrypt_core_2x.
+	movdqu	($inp), %xmm7		# Load input.
+	call	_vpaes_encrypt_core
+	pxor	%xmm7, %xmm0
+	paddd	%xmm8, %xmm6
+	movdqu	%xmm0, ($out,$inp)
+	sub	\$1, $blocks
+	lea	16($inp), $inp
+	jz	.Lctr32_done
+
+.Lctr32_prep_loop:
+	# _vpaes_encrypt_core_2x leaves only %xmm14 and %xmm15 as spare
+	# registers. We maintain two byte-swapped counters in them.
+	movdqa	%xmm6, %xmm14
+	movdqa	%xmm6, %xmm15
+	paddd	%xmm8, %xmm15
+
+.Lctr32_loop:
+	movdqa	.Lrev_ctr(%rip), %xmm1	# Set up counters.
+	movdqa	%xmm14, %xmm0
+	movdqa	%xmm15, %xmm6
+	pshufb	%xmm1, %xmm0
+	pshufb	%xmm1, %xmm6
+	call	_vpaes_encrypt_core_2x
+	movdqu	($inp), %xmm1		# Load input.
+	movdqu	16($inp), %xmm2
+	movdqa	.Lctr_add_two(%rip), %xmm3
+	pxor	%xmm1, %xmm0		# XOR input.
+	pxor	%xmm2, %xmm6
+	paddd	%xmm3, %xmm14		# Increment counters.
+	paddd	%xmm3, %xmm15
+	movdqu	%xmm0, ($out,$inp)	# Write output.
+	movdqu	%xmm6, 16($out,$inp)
+	sub	\$2, $blocks		# Advance loop.
+	lea	32($inp), $inp
+	jnz	.Lctr32_loop
+
+.Lctr32_done:
+___
+$code.=<<___ if ($win64);
+	movaps	0x10(%rsp),%xmm6
+	movaps	0x20(%rsp),%xmm7
+	movaps	0x30(%rsp),%xmm8
+	movaps	0x40(%rsp),%xmm9
+	movaps	0x50(%rsp),%xmm10
+	movaps	0x60(%rsp),%xmm11
+	movaps	0x70(%rsp),%xmm12
+	movaps	0x80(%rsp),%xmm13
+	movaps	0x90(%rsp),%xmm14
+	movaps	0xa0(%rsp),%xmm15
+	lea	0xb8(%rsp),%rsp
+.Lctr32_epilogue:
+___
+$code.=<<___;
+.Lctr32_abort:
+	ret
+.cfi_endproc
+.size	${PREFIX}_ctr32_encrypt_blocks,.-${PREFIX}_ctr32_encrypt_blocks
+___
+}
 $code.=<<___;
 ##
 ##  _aes_preheat
@ -1107,6 +1387,17 @@ _vpaes_consts:
 .Lk_dsbo:	# decryption sbox final output
 	.quad	0x1387EA537EF94000, 0xC7AA6DB9D4943E2D
 	.quad	0x12D7560F93441D00, 0xCA4B8159D8C58E9C
+
+# .Lrev_ctr is a permutation which byte-swaps the counter portion of the IV.
+.Lrev_ctr:
+	.quad	0x0706050403020100, 0x0c0d0e0f0b0a0908
+# .Lctr_add_* may be added to a byte-swapped xmm register to increment the
+# counter. The register must be byte-swapped again to form the actual input.
+.Lctr_add_one:
+	.quad	0x0000000000000000, 0x0000000100000000
+.Lctr_add_two:
+	.quad	0x0000000000000000, 0x0000000200000000
+
 .asciz	"Vector Permutation AES for x86_64/SSSE3, Mike Hamburg (Stanford University)"
 .align	64
 .size	_vpaes_consts,.-_vpaes_consts
@ -1222,6 +1513,10 @@ se_handler:
 	.rva	.LSEH_end_${PREFIX}_cbc_encrypt
 	.rva	.LSEH_info_${PREFIX}_cbc_encrypt

+	.rva	.LSEH_begin_${PREFIX}_ctr32_encrypt_blocks
+	.rva	.LSEH_end_${PREFIX}_ctr32_encrypt_blocks
+	.rva	.LSEH_info_${PREFIX}_ctr32_encrypt_blocks
+
 .section	.xdata
 .align	8
 .LSEH_info_${PREFIX}_set_encrypt_key:
@ -1244,6 +1539,10 @@ se_handler:
 	.byte	9,0,0,0
 	.rva	se_handler
 	.rva	.Lcbc_body,.Lcbc_epilogue		# HandlerData[]
+.LSEH_info_${PREFIX}_ctr32_encrypt_blocks:
+	.byte	9,0,0,0
+	.rva	se_handler
+	.rva	.Lctr32_body,.Lctr32_epilogue		# HandlerData[]
 ___
 }

--- a/crypto/fipsmodule/aes/internal.h
+++ b/crypto/fipsmodule/aes/internal.h
@ -35,15 +35,13 @@ OPENSSL_INLINE int hwaes_capable(void) {
 }

 #define VPAES
+#if defined(OPENSSL_X86_64)
+#define VPAES_CTR32
+#endif
 OPENSSL_INLINE int vpaes_capable(void) {
  return (OPENSSL_ia32cap_get()[1] & (1 << (41 - 32))) != 0;
 }

-#if defined(OPENSSL_X86_64)
-#define BSAES
-OPENSSL_INLINE int bsaes_capable(void) { return vpaes_capable(); }
-#endif  // X86_64
-
 #elif defined(OPENSSL_ARM) || defined(OPENSSL_AARCH64)
 #define HWAES