In https://eprint.iacr.org/2017/1015.pdf a technique was described to
improve the performance of Montgomery reduction for Montgomery-friendly
moduli. This adds an implementation using the mulx, adox and adcx
instructions, available in the BMI2 (since Haswell) and ADX (since
Broadwell) instruction set extensions.