Use dec/jnz instead of loop in bn_add_words and bn_sub_words.

Imported from upstream's a78324d95bd4568ce2c3b34bfa1d6f14cddf92ef. I
think the "regression" part of that change is some tweak to BN_usub and
I guess the bn_*_words was to compensate for it, but we may as well
import it. Apparently the loop instruction is terrible.

Before:
Did 39871000 bn_add_words operations in 1000002us (39870920.3 ops/sec)
Did 38621750 bn_sub_words operations in 1000001us (38621711.4 ops/sec)

After:
Did 64012000 bn_add_words operations in 1000007us (64011551.9 ops/sec)
Did 81792250 bn_sub_words operations in 1000002us (81792086.4 ops/sec)

loop sets no flags (even doing the comparison to zero without ZF) while
dec sets all flags but CF, so Andres and I are assuming that because
this prevents Intel from microcoding it to dec/jnz, they otherwise can't
be bothered to add more circuitry since every compiler has internalized
by now to never use loop.

Change-Id: I3927cd1c7b707841bbe9963e3d4afd7ba9bd9b36
Reviewed-on: https://boringssl-review.googlesource.com/23344
Reviewed-by: Adam Langley <agl@google.com>
This commit is contained in:
David Benjamin 2017-11-22 11:08:45 -05:00 committed by Adam Langley
parent 2056d7290a
commit 02514002fd

View File

@ -202,7 +202,8 @@ BN_ULONG bn_add_words(BN_ULONG *rp, const BN_ULONG *ap, const BN_ULONG *bp,
" adcq (%5,%2,8),%0 \n"
" movq %0,(%3,%2,8) \n"
" lea 1(%2),%2 \n"
" loop 1b \n"
" dec %1 \n"
" jnz 1b \n"
" sbbq %0,%0 \n"
: "=&r"(ret), "+c"(n), "+r"(i)
: "r"(rp), "r"(ap), "r"(bp)
@ -229,7 +230,8 @@ BN_ULONG bn_sub_words(BN_ULONG *rp, const BN_ULONG *ap, const BN_ULONG *bp,
" sbbq (%5,%2,8),%0 \n"
" movq %0,(%3,%2,8) \n"
" lea 1(%2),%2 \n"
" loop 1b \n"
" dec %1 \n"
" jnz 1b \n"
" sbbq %0,%0 \n"
: "=&r"(ret), "+c"(n), "+r"(i)
: "r"(rp), "r"(ap), "r"(bp)