1
0
mirror of https://github.com/henrydcase/nobs.git synced 2024-11-22 15:18:57 +00:00
Commit Graph

73 Commits

Author SHA1 Message Date
7c32db8dd7 sm3: use less operations for ff1 and gg1 2021-03-08 23:58:08 +00:00
8474981cfc
SHA-3: speedups (#47)
* add function for one-off calculation

* sha3: simplifies Read function

* sha3: remove if from Read
2020-10-03 23:27:08 +01:00
45bc1a75f6
add function for one-off calculation (#45) 2020-10-03 15:12:26 +01:00
adfaf1e58c
fix: ebx -> ecx (#46) 2020-10-03 15:11:52 +01:00
24408329a5 Use bits.RotateLeft64 whenever possible 2020-09-28 21:03:08 +01:00
0174e314a1
Update README.md 2020-08-29 02:13:24 +01:00
820906b7c7
sha3: optimizations and cleanup (#41)
* complate reset of the SHA-3 code. Affects mostly the code in sha3.go
* fixes a bug  which causes SHAKE implementation to crash
* implementation of Read()/Write() avoid unnecessary buffering as much
  as possible
* NOTE: at some point I've done separated implementation for SumXXX,
  functions, but after optimizing implementation of Read/Write/Sum, the
  gain wasn't that big

Current speed on Initial speed on i7-8665U@1.90

BenchmarkPermutationFunction 	 			 1592787	       736 ns/op	 271.90 MB/s	       0 B/op	       0 allocs/op
BenchmarkSha3Chunk_x01/SHA-3/224         	   98752	     11630 ns/op	 176.02 MB/s	       0 B/op	       0 allocs/op
BenchmarkSha3Chunk_x01/SHA-3/256         	   92508	     12447 ns/op	 164.46 MB/s	       0 B/op	       0 allocs/op
BenchmarkSha3Chunk_x01/SHA-3/384         	   76765	     15206 ns/op	 134.62 MB/s	       0 B/op	       0 allocs/op
BenchmarkSha3Chunk_x01/SHA-3/512         	   54333	     21932 ns/op	  93.33 MB/s	       0 B/op	       0 allocs/op
BenchmarkSha3Chunk_x16/SHA-3/224         	   10000	    102161 ns/op	 160.37 MB/s	       0 B/op	       0 allocs/op
BenchmarkSha3Chunk_x16/SHA-3/256         	   10000	    106531 ns/op	 153.80 MB/s	       0 B/op	       0 allocs/op
BenchmarkSha3Chunk_x16/SHA-3/384         	    8641	    137272 ns/op	 119.35 MB/s	       0 B/op	       0 allocs/op
BenchmarkSha3Chunk_x16/SHA-3/512         	    6340	    189124 ns/op	  86.63 MB/s	       0 B/op	       0 allocs/op
BenchmarkShake_x01/SHAKE-128             	  167062	      7149 ns/op	 188.83 MB/s	       0 B/op	       0 allocs/op
BenchmarkShake_x01/SHAKE-256             	  151982	      7748 ns/op	 174.24 MB/s	       0 B/op	       0 allocs/op
BenchmarkShake_x16/SHAKE-128             	   12963	     87770 ns/op	 186.67 MB/s	       0 B/op	       0 allocs/op
BenchmarkShake_x16/SHAKE-256             	   10000	    105554 ns/op	 155.22 MB/s	       0 B/op	       0 allocs/op
BenchmarkCShake/cSHAKE-128               	  109148	     10940 ns/op	 187.11 MB/s	       0 B/op	       0 allocs/op
BenchmarkCShake/cSHAKE-256               	   90324	     13211 ns/op	 154.94 MB/s	       0 B/op	       0 allocs/op
PASS
2020-08-29 02:12:49 +01:00
3a320e1714
Create FUNDING.yml 2020-08-28 23:32:08 +01:00
7dcb72bf74 remove shake 2020-08-25 22:39:56 +01:00
ffd7590213 improve comment
Initial speed on i7-8665U

> go test -bench=. -test.cpu=1
goos: linux
goarch: amd64
pkg: github.com/henrydcase/nobs/hash/sha3
BenchmarkPermutationFunction 	 1634836	       732 ns/op	 273.18 MB/s
BenchmarkSha3_512_MTU        	   78438	     15340 ns/op	  88.00 MB/s
BenchmarkSha3_384_MTU        	  108807	     11025 ns/op	 122.45 MB/s
BenchmarkSha3_256_MTU        	  136902	      8767 ns/op	 153.98 MB/s
BenchmarkSha3_224_MTU        	  143377	      8355 ns/op	 161.57 MB/s
BenchmarkShake128_MTU        	  163569	      7108 ns/op	 189.94 MB/s
BenchmarkShake256_MTU        	  156534	      7643 ns/op	 176.64 MB/s
BenchmarkShake256_16x        	   10000	    112109 ns/op	 146.14 MB/s
BenchmarkShake256_1MiB       	     204	   5877014 ns/op	 178.42 MB/s
BenchmarkSha3_512_1MiB       	     100	  10967026 ns/op	  95.61 MB/s
PASS
ok  	github.com/henrydcase/nobs/hash/sha3	13.855s
2020-08-25 17:09:40 +01:00
516ea4f5e8 cleanup 2020-08-25 12:32:22 +01:00
68ba33e34f sha3: remove s390 2020-08-25 12:11:08 +01:00
b56c355c8d
adds cycle count. fixes csidh which provides 128 not 512 bits of security (#38) 2020-08-25 11:22:53 +01:00
c30f61923a adds cycle count. fixes csidh which provides 128 not 512 bits of security 2020-08-25 11:21:11 +01:00
a02b9a77a0 mkem: add csidh 2020-08-25 11:19:07 +01:00
2500d74484 export more symbols from common 2020-05-16 22:37:41 +00:00
a152c09fd5
sike: move common (#33)
* makes common reusable
* exports some more symbols from common
* remove kem for a moment
2020-05-16 20:14:48 +00:00
55957bbf5e
sike: move common (#32)
* makes common reusable
* exports some more symbols from common
2020-05-16 18:51:34 +00:00
ab962715d5 Fixes cSIDH key generation when run in the loop 2020-05-14 11:53:23 +00:00
bc32024729
sidh: updates (#31) 2020-05-14 08:51:20 +00:00
f5a7daf2bb sidh: update to p434 2020-05-14 00:02:32 +00:00
91945fde1f csidh: cosmettic updates 2020-05-13 23:48:43 +00:00
7d891c7eb8
support go 1.14 (#29)
NOBS now supports Go 1.14 for
* x86-64
* ARM
2020-03-05 11:19:51 +00:00
d0692c81f0 Remove BS from README.md 2020-02-13 10:27:42 +00:00
48ea6a583a Remove BS from README.md 2020-02-13 10:27:18 +00:00
c5bff4fa11 Remove BS from README.md 2020-02-13 10:25:47 +00:00
2a73461591 remove crapy x448 2020-02-13 10:17:54 +00:00
7efbbf4745 cSIDH-511: (#26)
Implementation of Commutative Supersingular Isogeny Diffie Hellman,
based on "A faster way to CSIDH" paper (2018/782).

* For fast isogeny calculation, implementation converts a curve from
  Montgomery to Edwards. All calculations are done on Edwards curve
  and then converted back to Montgomery.
* As multiplication in a field Fp511 is most expensive operation
  the implementation contains multiple multiplications. It has
  most performant, assembly implementation which uses BMI2 and
  ADOX/ADCX instructions for modern CPUs. It also contains
  slower implementation which will run on older CPUs

* Benchmarks (Intel SkyLake):

  BenchmarkGeneratePrivate   	    6459	    172213 ns/op	       0 B/op	       0 allocs/op
  BenchmarkGenerateKeyPair   	      25	  45800356 ns/op	       0 B/op	       0 allocs/op
  BenchmarkValidate          	     297	   3915983 ns/op	       0 B/op	       0 allocs/op
  BenchmarkValidateRandom    	  184683	      6231 ns/op	       0 B/op	       0 allocs/op
  BenchmarkValidateGenerated 	      25	  48481306 ns/op	       0 B/op	       0 allocs/op
  BenchmarkDerive            	      19	  60928763 ns/op	       0 B/op	       0 allocs/op
  BenchmarkDeriveGenerated   	       8	 137342421 ns/op	       0 B/op	       0 allocs/op
  BenchmarkXMul              	    2311	    494267 ns/op	       1 B/op	       0 allocs/op
  BenchmarkXAdd              	 2396754	       501 ns/op	       0 B/op	       0 allocs/op
  BenchmarkXDbl              	 2072690	       571 ns/op	       0 B/op	       0 allocs/op
  BenchmarkIsom              	   78004	     15171 ns/op	       0 B/op	       0 allocs/op
  BenchmarkFp512Sub          	224635152	         5.33 ns/op	       0 B/op	       0 allocs/op
  BenchmarkFp512Mul          	246633255	         4.90 ns/op	       0 B/op	       0 allocs/op
  BenchmarkCSwap             	233228547	         5.10 ns/op	       0 B/op	       0 allocs/op
  BenchmarkAddRdc            	87348240	        12.6 ns/op	       0 B/op	       0 allocs/op
  BenchmarkSubRdc            	95112787	        11.7 ns/op	       0 B/op	       0 allocs/op
  BenchmarkModExpRdc         	   25436	     46878 ns/op	       0 B/op	       0 allocs/op
  BenchmarkMulBmiAsm         	19527573	        60.1 ns/op	       0 B/op	       0 allocs/op
  BenchmarkMulGeneric        	 7117650	       164 ns/op	       0 B/op	       0 allocs/op

* Go code has very similar performance when compared to C
  implementation.
  Results from sidh_torturer (4e2996e12d68364761064341cbe1d1b47efafe23)
  github.com:henrydcase/sidh-torture/csidh

  | TestName         |Go        | C        |
  |------------------|----------|----------|
  |TestSharedSecret  | 57.95774 | 57.91092 |
  |TestKeyGeneration | 62.23614 | 58.12980 |
  |TestSharedSecret  | 55.28988 | 57.23132 |
  |TestKeyGeneration | 61.68745 | 58.66396 |
  |TestSharedSecret  | 63.19408 | 58.64774 |
  |TestKeyGeneration | 62.34022 | 61.62539 |
  |TestSharedSecret  | 62.85453 | 68.74503 |
  |TestKeyGeneration | 52.58518 | 58.40115 |
  |TestSharedSecret  | 50.77081 | 61.91699 |
  |TestKeyGeneration | 59.91843 | 61.09266 |
  |TestSharedSecret  | 59.97962 | 62.98151 |
  |TestKeyGeneration | 64.57525 | 56.22863 |
  |TestSharedSecret  | 56.40521 | 55.77447 |
  |TestKeyGeneration | 67.85850 | 58.52604 |
  |TestSharedSecret  | 60.54290 | 65.14052 |
  |TestKeyGeneration | 65.45766 | 58.42823 |

  On average Go implementation is 2% faster.
2019-11-25 15:03:29 +00:00
15f6ee16b9 SHA-3: Fixes crash when cloning Shake state 2019-05-26 17:29:15 +01:00
9b3c0190b0 Updates P34 strategy calculation 2019-05-23 18:32:28 +01:00
7298b650cc
Adds go.mod (#21)
* Reset Makefile after adding go.mod
* Remove ``build`` directory
* Simiplifies makefile
* shake: Make xorIn copyOut platform specific
2019-05-15 18:03:35 +01:00
49bf0db8fd SHAKE: Don't use function pointers (#20)
* xorIn and copyOut function pointers cause input and output data
  to be moved to heap. This degrades performance of calling code.

* This change removes usage of those function pointers. We will always
  use unaligned implementation as it's faster (but may crash on some
  systems)

* Benchmark compares generic vs unaligned xorIn and copyOut

benchmark                          old ns/op     new ns/op     delta
BenchmarkPermutationFunction-4     463           815           +76.03%
BenchmarkShake128_MTU-4            4443          8180          +84.11%
BenchmarkShake256_MTU-4            4739          9060          +91.18%
BenchmarkShake256_16x-4            71886         132629        +84.50%
BenchmarkShake256_1MiB-4           3695138       6649012       +79.94%
BenchmarkCShake128_448_16x-4       21210         24611         +16.03%
BenchmarkCShake128_1MiB-4          3009342       3396496       +12.87%
BenchmarkCShake256_448_16x-4       26034         27785         +6.73%
BenchmarkCShake256_1MiB-4          3654713       3829404       +4.78%
2019-05-14 17:08:33 +01:00
e6439f96ab Adds cSHAKE with 0 alloc interface (#19) 2019-05-14 01:19:29 +01:00
6f9706df01
CTR-DRBG: Use hardware acceleration on X86 (#18)
benchmark              old ns/op     new ns/op     delta
BenchmarkInit-4        3403          397           -88.33%
BenchmarkRead-4        14535         1560          -89.27%
2019-04-09 23:50:21 +01:00
71624cdc4c Improvements to makefile 2019-04-09 17:30:30 +01:00
b184944242 Nits for SIDH 2019-04-09 17:09:34 +01:00
08f7315b64 DRBG: Speed improvements
* CTR-DRBG doesn't call "NewCipher" for block encryption
* Changes API of CTR-DRBG, so that read operation implementes io.Reader

Benchmark results:
----------------------
benchmark           old ns/op     new ns/op     delta
BenchmarkInit-4     1118          3579          +220.13%
BenchmarkRead-4     5343          14589         +173.05%

benchmark           old allocs     new allocs     delta
BenchmarkInit-4     15             0              -100.00%
BenchmarkRead-4     67             0              -100.00%

benchmark           old bytes     new bytes     delta
BenchmarkInit-4     1824          0             -100.00%
BenchmarkRead-4     9488          0             -100.00%
2019-04-09 14:37:59 +01:00
e66cc99401 Improves comment 2019-02-19 14:44:11 +00:00
b47a731959
Run tests on ARM64 (#11) 2019-02-16 21:29:20 +00:00
90f8cba329
SIDH: Update (#9)
* Change license to BSD-3

* SIDH: Multiple developlemnts
2018-12-03 23:07:01 +00:00
ea2ffa2d61 PERF: sidh-p503: Split sub and add into 2 uops instead of 3 (#8)
The performance improvement comes from the fact that on Skylake
"add mem, reg" splits into 2 uops - one arithmetic uop and another one
for loading a value from mem.
However, changing operand order to "add reg, mem" splits into 3 uops:
one for arithmetic op, one for load and one additional one for storing
the result back.
Using separated instruction for loading/storing helps to parallelize
execution (load/store and arithmetic instruction is done in parallel
if possible)

For details, see: https://www.agner.org/optimize/instruction_tables.pdf

New: BenchmarkFp503StrongReduce-4    300000000            5.57 ns/op
Old: BenchmarkFp503StrongReduce-4    200000000            8.60 ns/op

This just improves one function, but more functions can be improved
2018-11-18 20:57:29 +00:00
e9ddb6fb45
sidh/csidh: use SEE for performing CSWAP (#6)
* Makefile

* makefile: tools for profiling

* sidh: use SIMD for performing CSWAP

Loads data into 128-bit XMM registers and performs conditional swap.
This is probably less useful for SIDH, but will be useful for cSIDH
2018-10-29 15:41:09 +00:00
a456dc4dd9 readme: License 2018-10-25 15:22:28 +01:00
ae57368c7b License BS for sha3 2018-10-25 15:22:28 +01:00
00c16fe97e License bulshit 2018-10-25 15:22:28 +01:00
65bbafeef5 script used for calculating sliding window startegy in SIDH P34 2018-10-25 15:22:28 +01:00
0531c3479b Update README.md 2018-10-25 15:22:28 +01:00
1e34845d00 complate rewrite for SIDH and SIKE. adds p503 (#5) 2018-10-25 15:22:28 +01:00
d6fc82531f Doc 2018-10-25 15:22:28 +01:00
b769c88767 Improves some comments and hardcodes precomputed value (#4)
* Improves some comments and hardcodes precomputed value

* Tests curve coefficients recovery
2018-10-25 15:22:28 +01:00