1
0
mirror of https://github.com/henrydcase/nobs.git synced 2024-11-22 15:18:57 +00:00
Commit Graph

13 Commits

Author SHA1 Message Date
7efbbf4745 cSIDH-511: (#26)
Implementation of Commutative Supersingular Isogeny Diffie Hellman,
based on "A faster way to CSIDH" paper (2018/782).

* For fast isogeny calculation, implementation converts a curve from
  Montgomery to Edwards. All calculations are done on Edwards curve
  and then converted back to Montgomery.
* As multiplication in a field Fp511 is most expensive operation
  the implementation contains multiple multiplications. It has
  most performant, assembly implementation which uses BMI2 and
  ADOX/ADCX instructions for modern CPUs. It also contains
  slower implementation which will run on older CPUs

* Benchmarks (Intel SkyLake):

  BenchmarkGeneratePrivate   	    6459	    172213 ns/op	       0 B/op	       0 allocs/op
  BenchmarkGenerateKeyPair   	      25	  45800356 ns/op	       0 B/op	       0 allocs/op
  BenchmarkValidate          	     297	   3915983 ns/op	       0 B/op	       0 allocs/op
  BenchmarkValidateRandom    	  184683	      6231 ns/op	       0 B/op	       0 allocs/op
  BenchmarkValidateGenerated 	      25	  48481306 ns/op	       0 B/op	       0 allocs/op
  BenchmarkDerive            	      19	  60928763 ns/op	       0 B/op	       0 allocs/op
  BenchmarkDeriveGenerated   	       8	 137342421 ns/op	       0 B/op	       0 allocs/op
  BenchmarkXMul              	    2311	    494267 ns/op	       1 B/op	       0 allocs/op
  BenchmarkXAdd              	 2396754	       501 ns/op	       0 B/op	       0 allocs/op
  BenchmarkXDbl              	 2072690	       571 ns/op	       0 B/op	       0 allocs/op
  BenchmarkIsom              	   78004	     15171 ns/op	       0 B/op	       0 allocs/op
  BenchmarkFp512Sub          	224635152	         5.33 ns/op	       0 B/op	       0 allocs/op
  BenchmarkFp512Mul          	246633255	         4.90 ns/op	       0 B/op	       0 allocs/op
  BenchmarkCSwap             	233228547	         5.10 ns/op	       0 B/op	       0 allocs/op
  BenchmarkAddRdc            	87348240	        12.6 ns/op	       0 B/op	       0 allocs/op
  BenchmarkSubRdc            	95112787	        11.7 ns/op	       0 B/op	       0 allocs/op
  BenchmarkModExpRdc         	   25436	     46878 ns/op	       0 B/op	       0 allocs/op
  BenchmarkMulBmiAsm         	19527573	        60.1 ns/op	       0 B/op	       0 allocs/op
  BenchmarkMulGeneric        	 7117650	       164 ns/op	       0 B/op	       0 allocs/op

* Go code has very similar performance when compared to C
  implementation.
  Results from sidh_torturer (4e2996e12d68364761064341cbe1d1b47efafe23)
  github.com:henrydcase/sidh-torture/csidh

  | TestName         |Go        | C        |
  |------------------|----------|----------|
  |TestSharedSecret  | 57.95774 | 57.91092 |
  |TestKeyGeneration | 62.23614 | 58.12980 |
  |TestSharedSecret  | 55.28988 | 57.23132 |
  |TestKeyGeneration | 61.68745 | 58.66396 |
  |TestSharedSecret  | 63.19408 | 58.64774 |
  |TestKeyGeneration | 62.34022 | 61.62539 |
  |TestSharedSecret  | 62.85453 | 68.74503 |
  |TestKeyGeneration | 52.58518 | 58.40115 |
  |TestSharedSecret  | 50.77081 | 61.91699 |
  |TestKeyGeneration | 59.91843 | 61.09266 |
  |TestSharedSecret  | 59.97962 | 62.98151 |
  |TestKeyGeneration | 64.57525 | 56.22863 |
  |TestSharedSecret  | 56.40521 | 55.77447 |
  |TestKeyGeneration | 67.85850 | 58.52604 |
  |TestSharedSecret  | 60.54290 | 65.14052 |
  |TestKeyGeneration | 65.45766 | 58.42823 |

  On average Go implementation is 2% faster.
2019-11-25 15:03:29 +00:00
b184944242 Nits for SIDH 2019-04-09 17:09:34 +01:00
e66cc99401 Improves comment 2019-02-19 14:44:11 +00:00
90f8cba329
SIDH: Update (#9)
* Change license to BSD-3

* SIDH: Multiple developlemnts
2018-12-03 23:07:01 +00:00
ea2ffa2d61 PERF: sidh-p503: Split sub and add into 2 uops instead of 3 (#8)
The performance improvement comes from the fact that on Skylake
"add mem, reg" splits into 2 uops - one arithmetic uop and another one
for loading a value from mem.
However, changing operand order to "add reg, mem" splits into 3 uops:
one for arithmetic op, one for load and one additional one for storing
the result back.
Using separated instruction for loading/storing helps to parallelize
execution (load/store and arithmetic instruction is done in parallel
if possible)

For details, see: https://www.agner.org/optimize/instruction_tables.pdf

New: BenchmarkFp503StrongReduce-4    300000000            5.57 ns/op
Old: BenchmarkFp503StrongReduce-4    200000000            8.60 ns/op

This just improves one function, but more functions can be improved
2018-11-18 20:57:29 +00:00
e9ddb6fb45
sidh/csidh: use SEE for performing CSWAP (#6)
* Makefile

* makefile: tools for profiling

* sidh: use SIMD for performing CSWAP

Loads data into 128-bit XMM registers and performs conditional swap.
This is probably less useful for SIDH, but will be useful for cSIDH
2018-10-29 15:41:09 +00:00
1e34845d00 complate rewrite for SIDH and SIKE. adds p503 (#5) 2018-10-25 15:22:28 +01:00
d6fc82531f Doc 2018-10-25 15:22:28 +01:00
b769c88767 Improves some comments and hardcodes precomputed value (#4)
* Improves some comments and hardcodes precomputed value

* Tests curve coefficients recovery
2018-10-25 15:22:28 +01:00
ddbd866ee5 additional comments 2018-07-31 20:21:32 +01:00
73c9938c59 Use ADCB instead of SBBL in checkLessThanThree238 2018-07-31 17:10:03 +01:00
105532aa09 sidh: move p751 implementation to p751 folder 2018-07-27 00:09:34 +01:00
a4d12ceaae adds SIKE and SIDH 2018-07-23 23:18:38 +01:00