Nelze vybrat více než 25 témat Téma musí začínat písmenem nebo číslem, může obsahovat pomlčky („-“) a může být dlouhé až 35 znaků.
 
 

632 řádky
21 KiB

  1. #include "asm-common.h"
  2. .arch armv8-a+crypto
  3. .text
  4. ///--------------------------------------------------------------------------
  5. /// Multiplication macros.
  6. // The good news is that we have a fancy instruction to do the
  7. // multiplications. The bad news is that it's not particularly well-
  8. // suited to the job.
  9. //
  10. // For one thing, it only does a 64-bit multiplication, so in general
  11. // we'll need to synthesize the full-width multiply by hand. For
  12. // another thing, it doesn't help with the reduction, so we have to
  13. // do that by hand too. And, finally, GCM has crazy bit ordering,
  14. // and the instruction does nothing useful for that at all.
  15. //
  16. // Focusing on that last problem first: the bits aren't in monotonic
  17. // significance order unless we permute them. Fortunately, ARM64 has
  18. // an instruction which will just permute the bits in each byte for
  19. // us, so we don't have to worry about this very much.
  20. //
  21. // Our main weapons, the `pmull' and `pmull2' instructions, work on
  22. // 64-bit operands, in half of a vector register, and produce 128-bit
  23. // results. But neither of them will multiply the high half of one
  24. // vector by the low half of a second one, so we have a problem,
  25. // which we solve by representing one of the operands redundantly:
  26. // rather than packing the 64-bit pieces together, we duplicate each
  27. // 64-bit piece across both halves of a register.
  28. //
  29. // The commentary for `mul128' is the most detailed. The other
  30. // macros assume that you've already read and understood that.
  31. .macro mul128
  32. // Enter with u and v in v0 and v1/v2 respectively, and 0 in v31;
  33. // leave with z = u v in v0. Clobbers v1--v6.
  34. // First for the double-precision multiplication. It's tempting to
  35. // use Karatsuba's identity here, but I suspect that loses more in
  36. // the shifting, bit-twiddling, and dependency chains that it gains
  37. // in saving a multiplication which otherwise pipelines well.
  38. // v0 = // (u_0; u_1)
  39. // v1/v2 = // (v_0; v_1)
  40. pmull2 v3.1q, v0.2d, v1.2d // u_1 v_0
  41. pmull v4.1q, v0.1d, v2.1d // u_0 v_1
  42. pmull2 v5.1q, v0.2d, v2.2d // (t_1; x_3) = u_1 v_1
  43. pmull v6.1q, v0.1d, v1.1d // (x_0; t_0) = u_0 v_0
  44. // Arrange the pieces to form a double-precision polynomial.
  45. eor v3.16b, v3.16b, v4.16b // (m_0; m_1) = u_0 v_1 + u_1 v_0
  46. vshr128 v4, v3, 64 // (m_1; 0)
  47. vshl128 v3, v3, 64 // (0; m_0)
  48. eor v1.16b, v5.16b, v4.16b // (x_2; x_3)
  49. eor v0.16b, v6.16b, v3.16b // (x_0; x_1)
  50. // And now the only remaining difficulty is that the result needs to
  51. // be reduced modulo p(t) = t^128 + t^7 + t^2 + t + 1. Let R = t^128
  52. // = t^7 + t^2 + t + 1 in our field. So far, we've calculated z_0
  53. // and z_1 such that z_0 + z_1 R = u v using the identity R = t^128:
  54. // now we must collapse the two halves of y together using the other
  55. // identity R = t^7 + t^2 + t + 1.
  56. //
  57. // We do this by working on y_2 and y_3 separately, so consider y_i
  58. // for i = 2 or 3. Certainly, y_i t^{64i} = y_i R t^{64(i-2) =
  59. // (t^7 + t^2 + t + 1) y_i t^{64(i-2)}, but we can't use that
  60. // directly without breaking up the 64-bit word structure. Instead,
  61. // we start by considering just y_i t^7 t^{64(i-2)}, which again
  62. // looks tricky. Now, split y_i = a_i + t^57 b_i, with deg a_i < 57;
  63. // then
  64. //
  65. // y_i t^7 t^{64(i-2)} = a_i t^7 t^{64(i-2)} + b_i t^{64(i-1)}
  66. //
  67. // We can similarly decompose y_i t^2 and y_i t into a pair of 64-bit
  68. // contributions to the t^{64(i-2)} and t^{64(i-1)} words, but the
  69. // splits are different. This is lovely, with one small snag: when
  70. // we do this to y_3, we end up with a contribution back into the
  71. // t^128 coefficient word. But notice that only the low seven bits
  72. // of this word are affected, so there's no knock-on contribution
  73. // into the t^64 word. Therefore, if we handle the high bits of each
  74. // word together, and then the low bits, everything will be fine.
  75. // First, shift the high bits down.
  76. ushr v2.2d, v1.2d, #63 // the b_i for t
  77. ushr v3.2d, v1.2d, #62 // the b_i for t^2
  78. ushr v4.2d, v1.2d, #57 // the b_i for t^7
  79. eor v2.16b, v2.16b, v3.16b // add them all together
  80. eor v2.16b, v2.16b, v4.16b
  81. vshr128 v3, v2, 64
  82. vshl128 v4, v2, 64
  83. eor v1.16b, v1.16b, v3.16b // contribution into high half
  84. eor v0.16b, v0.16b, v4.16b // and low half
  85. // And then shift the low bits up.
  86. shl v2.2d, v1.2d, #1
  87. shl v3.2d, v1.2d, #2
  88. shl v4.2d, v1.2d, #7
  89. eor v1.16b, v1.16b, v2.16b // unit and t contribs
  90. eor v3.16b, v3.16b, v4.16b // t^2 and t^7 contribs
  91. eor v0.16b, v0.16b, v1.16b // mix everything together
  92. eor v0.16b, v0.16b, v3.16b // ... and we're done
  93. .endm
  94. .macro mul64
  95. // Enter with u and v in the low halves of v0 and v1, respectively;
  96. // leave with z = u v in x2. Clobbers x2--x4.
  97. // The multiplication is thankfully easy.
  98. // v0 = // (u; ?)
  99. // v1 = // (v; ?)
  100. pmull v0.1q, v0.1d, v1.1d // u v
  101. // Now we must reduce. This is essentially the same as the 128-bit
  102. // case above, but mostly simpler because everything is smaller. The
  103. // polynomial this time is p(t) = t^64 + t^4 + t^3 + t + 1.
  104. // Before we get stuck in, transfer the product to general-purpose
  105. // registers.
  106. mov x3, v0.d[1]
  107. mov x2, v0.d[0]
  108. // First, shift the high bits down.
  109. eor x4, x3, x3, lsr #1 // pre-mix t^3 and t^4
  110. eor x3, x3, x3, lsr #63 // mix in t contribution
  111. eor x3, x3, x4, lsr #60 // shift and mix in t^3 and t^4
  112. // And then shift the low bits up.
  113. eor x3, x3, x3, lsl #1 // mix unit and t; pre-mix t^3, t^4
  114. eor x2, x2, x3 // fold them in
  115. eor x2, x2, x3, lsl #3 // and t^3 and t^4
  116. .endm
  117. .macro mul96
  118. // Enter with u in the least-significant 96 bits of v0, with zero in
  119. // the upper 32 bits, and with the least-significant 64 bits of v in
  120. // both halves of v1, and the upper 32 bits of v in the low 32 bits
  121. // of each half of v2, with zero in the upper 32 bits; and with zero
  122. // in v31. Yes, that's a bit hairy. Leave with the product u v in
  123. // the low 96 bits of v0, and /junk/ in the high 32 bits. Clobbers
  124. // v1--v6.
  125. // This is an inconvenient size. There's nothing for it but to do
  126. // four multiplications, as if for the 128-bit case. It's possible
  127. // that there's cruft in the top 32 bits of the input registers, so
  128. // shift both of them up by four bytes before we start. This will
  129. // mean that the high 64 bits of the result (from GCM's viewpoint)
  130. // will be zero.
  131. // v0 = // (u_0 + u_1 t^32; u_2)
  132. // v1 = // (v_0 + v_1 t^32; v_0 + v_1 t^32)
  133. // v2 = // (v_2; v_2)
  134. pmull2 v5.1q, v0.2d, v1.2d // u_2 (v_0 + v_1 t^32) t^32 = e_0
  135. pmull v4.1q, v0.1d, v2.1d // v_2 (u_0 + u_1 t^32) t^32 = e_1
  136. pmull2 v6.1q, v0.2d, v2.2d // u_2 v_2 = d = (d; 0)
  137. pmull v3.1q, v0.1d, v1.1d // u_0 v_0 + (u_0 v_1 + u_1 v_0) t^32
  138. // + u_1 v_1 t^64 = f
  139. // Extract the high and low halves of the 192-bit result. The answer
  140. // we want is d t^128 + e t^64 + f, where e = e_0 + e_1. The low 96
  141. // bits of the answer will end up in v0, with junk in the top 32
  142. // bits; the high 96 bits will end up in v1, which must have zero in
  143. // its top 32 bits.
  144. //
  145. // Here, bot(x) is the low 96 bits of a 192-bit quantity x, arranged
  146. // in the low 96 bits of a SIMD register, with junk in the top 32
  147. // bits; and top(x) is the high 96 bits, also arranged in the low 96
  148. // bits of a register, with /zero/ in the top 32 bits.
  149. eor v4.16b, v4.16b, v5.16b // e_0 + e_1 = e
  150. vshl128 v6, v6, 32 // top(d t^128)
  151. vshr128 v5, v4, 32 // top(e t^64)
  152. vshl128 v4, v4, 64 // bot(e t^64)
  153. vshr128 v1, v3, 96 // top(f)
  154. eor v6.16b, v6.16b, v5.16b // top(d t^128 + e t^64)
  155. eor v0.16b, v3.16b, v4.16b // bot([d t^128] + e t^64 + f)
  156. eor v1.16b, v1.16b, v6.16b // top(e t^64 + d t^128 + f)
  157. // Finally, the reduction. This is essentially the same as the
  158. // 128-bit case, except that the polynomial is p(t) = t^96 + t^10 +
  159. // t^9 + t^6 + 1. The degrees are larger but not enough to cause
  160. // trouble for the general approach. Unfortunately, we have to do
  161. // this in 32-bit pieces rather than 64.
  162. // First, shift the high bits down.
  163. ushr v2.4s, v1.4s, #26 // the b_i for t^6
  164. ushr v3.4s, v1.4s, #23 // the b_i for t^9
  165. ushr v4.4s, v1.4s, #22 // the b_i for t^10
  166. eor v2.16b, v2.16b, v3.16b // add them all together
  167. eor v2.16b, v2.16b, v4.16b
  168. vshr128 v3, v2, 64 // contribution for high half
  169. vshl128 v2, v2, 32 // contribution for low half
  170. eor v1.16b, v1.16b, v3.16b // apply to high half
  171. eor v0.16b, v0.16b, v2.16b // and low half
  172. // And then shift the low bits up.
  173. shl v2.4s, v1.4s, #6
  174. shl v3.4s, v1.4s, #9
  175. shl v4.4s, v1.4s, #10
  176. eor v1.16b, v1.16b, v2.16b // unit and t^6 contribs
  177. eor v3.16b, v3.16b, v4.16b // t^9 and t^10 contribs
  178. eor v0.16b, v0.16b, v1.16b // mix everything together
  179. eor v0.16b, v0.16b, v3.16b // ... and we're done
  180. .endm
  181. .macro mul192
  182. // Enter with u in v0 and the less-significant half of v1, with v
  183. // duplicated across both halves of v2/v3/v4, and with zero in v31.
  184. // Leave with the product u v in v0 and the bottom half of v1.
  185. // Clobbers v16--v25.
  186. // Start multiplying and accumulating pieces of product.
  187. // v0 = // (u_0; u_1)
  188. // v1 = // (u_2; ?)
  189. // v2 = // (v_0; v_0)
  190. // v3 = // (v_1; v_1)
  191. // v4 = // (v_2; v_2)
  192. pmull v16.1q, v0.1d, v2.1d // a = u_0 v_0
  193. pmull v19.1q, v0.1d, v3.1d // u_0 v_1
  194. pmull2 v21.1q, v0.2d, v2.2d // u_1 v_0
  195. pmull v17.1q, v0.1d, v4.1d // u_0 v_2
  196. pmull2 v22.1q, v0.2d, v3.2d // u_1 v_1
  197. pmull v23.1q, v1.1d, v2.1d // u_2 v_0
  198. eor v19.16b, v19.16b, v21.16b // b = u_0 v_1 + u_1 v_0
  199. pmull2 v20.1q, v0.2d, v4.2d // u_1 v_2
  200. pmull v24.1q, v1.1d, v3.1d // u_2 v_1
  201. eor v17.16b, v17.16b, v22.16b // u_0 v_2 + u_1 v_1
  202. pmull v18.1q, v1.1d, v4.1d // e = u_2 v_2
  203. eor v17.16b, v17.16b, v23.16b // c = u_0 v_2 + u_1 v_1 + u_2 v_1
  204. eor v20.16b, v20.16b, v24.16b // d = u_1 v_2 + u_2 v_1
  205. // Piece the product together.
  206. // v16 = // (a_0; a_1)
  207. // v19 = // (b_0; b_1)
  208. // v17 = // (c_0; c_1)
  209. // v20 = // (d_0; d_1)
  210. // v18 = // (e_0; e_1)
  211. vshl128 v21, v19, 64 // (0; b_0)
  212. ext v22.16b, v19.16b, v20.16b, #8 // (b_1; d_0)
  213. vshr128 v23, v20, 64 // (d_1; 0)
  214. eor v16.16b, v16.16b, v21.16b // (x_0; x_1)
  215. eor v17.16b, v17.16b, v22.16b // (x_2; x_3)
  216. eor v18.16b, v18.16b, v23.16b // (x_2; x_3)
  217. // Next, the reduction. Our polynomial this time is p(x) = t^192 +
  218. // t^7 + t^2 + t + 1. Yes, the magic numbers are the same as the
  219. // 128-bit case. I don't know why.
  220. // First, shift the high bits down.
  221. // v16 = // (y_0; y_1)
  222. // v17 = // (y_2; y_3)
  223. // v18 = // (y_4; y_5)
  224. mov v19.d[0], v17.d[1] // (y_3; ?)
  225. ushr v23.2d, v18.2d, #63 // hi b_i for t
  226. ushr d20, d19, #63 // lo b_i for t
  227. ushr v24.2d, v18.2d, #62 // hi b_i for t^2
  228. ushr d21, d19, #62 // lo b_i for t^2
  229. ushr v25.2d, v18.2d, #57 // hi b_i for t^7
  230. ushr d22, d19, #57 // lo b_i for t^7
  231. eor v23.16b, v23.16b, v24.16b // mix them all together
  232. eor v20.8b, v20.8b, v21.8b
  233. eor v23.16b, v23.16b, v25.16b
  234. eor v20.8b, v20.8b, v22.8b
  235. // Permute the high pieces while we fold in the b_i.
  236. eor v17.16b, v17.16b, v23.16b
  237. vshl128 v20, v20, 64
  238. mov v19.d[0], v18.d[1] // (y_5; ?)
  239. ext v18.16b, v17.16b, v18.16b, #8 // (y_3; y_4)
  240. eor v16.16b, v16.16b, v20.16b
  241. // And finally shift the low bits up.
  242. // v16 = // (y'_0; y'_1)
  243. // v17 = // (y'_2; ?)
  244. // v18 = // (y'_3; y'_4)
  245. // v19 = // (y'_5; ?)
  246. shl v20.2d, v18.2d, #1
  247. shl d23, d19, #1
  248. shl v21.2d, v18.2d, #2
  249. shl d24, d19, #2
  250. shl v22.2d, v18.2d, #7
  251. shl d25, d19, #7
  252. eor v18.16b, v18.16b, v20.16b // unit and t contribs
  253. eor v19.8b, v19.8b, v23.8b
  254. eor v21.16b, v21.16b, v22.16b // t^2 and t^7 contribs
  255. eor v24.8b, v24.8b, v25.8b
  256. eor v18.16b, v18.16b, v21.16b // all contribs
  257. eor v19.8b, v19.8b, v24.8b
  258. eor v0.16b, v16.16b, v18.16b // mix them into the low half
  259. eor v1.8b, v17.8b, v19.8b
  260. .endm
  261. .macro mul256
  262. // Enter with u in v0/v1, with v duplicated across both halves of
  263. // v2--v5, and with zero in v31. Leave with the product u v in
  264. // v0/v1. Clobbers ???.
  265. // Now it's starting to look worthwhile to do Karatsuba. Suppose
  266. // u = u_0 + u_1 B and v = v_0 + v_1 B. Then
  267. //
  268. // u v = (u_0 v_0) + (u_0 v_1 + u_1 v_0) B + (u_1 v_1) B^2
  269. //
  270. // Name these coefficients of B^i be a, b, and c, respectively, and
  271. // let r = u_0 + u_1 and s = v_0 + v_1. Then observe that
  272. //
  273. // q = r s = (u_0 + u_1) (v_0 + v_1)
  274. // = (u_0 v_0) + (u1 v_1) + (u_0 v_1 + u_1 v_0)
  275. // = a + d + c
  276. //
  277. // The first two terms we've already calculated; the last is the
  278. // remaining one we want. We'll set B = t^128. We know how to do
  279. // 128-bit multiplications already, and Karatsuba is too annoying
  280. // there, so there'll be 12 multiplications altogether, rather than
  281. // the 16 we'd have if we did this the naïve way.
  282. // v0 = // u_0 = (u_00; u_01)
  283. // v1 = // u_1 = (u_10; u_11)
  284. // v2 = // (v_00; v_00)
  285. // v3 = // (v_01; v_01)
  286. // v4 = // (v_10; v_10)
  287. // v5 = // (v_11; v_11)
  288. eor v28.16b, v0.16b, v1.16b // u_* = (u_00 + u_10; u_01 + u_11)
  289. eor v29.16b, v2.16b, v4.16b // v_*0 = v_00 + v_10
  290. eor v30.16b, v3.16b, v5.16b // v_*1 = v_01 + v_11
  291. // Start by building the cross product, q = u_* v_*.
  292. pmull v24.1q, v28.1d, v30.1d // u_*0 v_*1
  293. pmull2 v25.1q, v28.2d, v29.2d // u_*1 v_*0
  294. pmull v20.1q, v28.1d, v29.1d // u_*0 v_*0
  295. pmull2 v21.1q, v28.2d, v30.2d // u_*1 v_*1
  296. eor v24.16b, v24.16b, v25.16b // u_*0 v_*1 + u_*1 v_*0
  297. vshr128 v25, v24, 64
  298. vshl128 v24, v24, 64
  299. eor v20.16b, v20.16b, v24.16b // q_0
  300. eor v21.16b, v21.16b, v25.16b // q_1
  301. // Next, work on the low half, a = u_0 v_0
  302. pmull v24.1q, v0.1d, v3.1d // u_00 v_01
  303. pmull2 v25.1q, v0.2d, v2.2d // u_01 v_00
  304. pmull v16.1q, v0.1d, v2.1d // u_00 v_00
  305. pmull2 v17.1q, v0.2d, v3.2d // u_01 v_01
  306. eor v24.16b, v24.16b, v25.16b // u_00 v_01 + u_01 v_00
  307. vshr128 v25, v24, 64
  308. vshl128 v24, v24, 64
  309. eor v16.16b, v16.16b, v24.16b // a_0
  310. eor v17.16b, v17.16b, v25.16b // a_1
  311. // Mix the pieces we have so far.
  312. eor v20.16b, v20.16b, v16.16b
  313. eor v21.16b, v21.16b, v17.16b
  314. // Finally, work on the high half, c = u_1 v_1
  315. pmull v24.1q, v1.1d, v5.1d // u_10 v_11
  316. pmull2 v25.1q, v1.2d, v4.2d // u_11 v_10
  317. pmull v18.1q, v1.1d, v4.1d // u_10 v_10
  318. pmull2 v19.1q, v1.2d, v5.2d // u_11 v_11
  319. eor v24.16b, v24.16b, v25.16b // u_10 v_11 + u_11 v_10
  320. vshr128 v25, v24, 64
  321. vshl128 v24, v24, 64
  322. eor v18.16b, v18.16b, v24.16b // c_0
  323. eor v19.16b, v19.16b, v25.16b // c_1
  324. // Finish mixing the product together.
  325. eor v20.16b, v20.16b, v18.16b
  326. eor v21.16b, v21.16b, v19.16b
  327. eor v17.16b, v17.16b, v20.16b
  328. eor v18.16b, v18.16b, v21.16b
  329. // Now we must reduce. This is essentially the same as the 192-bit
  330. // case above, but more complicated because everything is bigger.
  331. // The polynomial this time is p(t) = t^256 + t^10 + t^5 + t^2 + 1.
  332. // v16 = // (y_0; y_1)
  333. // v17 = // (y_2; y_3)
  334. // v18 = // (y_4; y_5)
  335. // v19 = // (y_6; y_7)
  336. ushr v24.2d, v18.2d, #62 // (y_4; y_5) b_i for t^2
  337. ushr v25.2d, v19.2d, #62 // (y_6; y_7) b_i for t^2
  338. ushr v26.2d, v18.2d, #59 // (y_4; y_5) b_i for t^5
  339. ushr v27.2d, v19.2d, #59 // (y_6; y_7) b_i for t^5
  340. ushr v28.2d, v18.2d, #54 // (y_4; y_5) b_i for t^10
  341. ushr v29.2d, v19.2d, #54 // (y_6; y_7) b_i for t^10
  342. eor v24.16b, v24.16b, v26.16b // mix the contributions together
  343. eor v25.16b, v25.16b, v27.16b
  344. eor v24.16b, v24.16b, v28.16b
  345. eor v25.16b, v25.16b, v29.16b
  346. vshr128 v26, v25, 64 // slide contribs into position
  347. ext v25.16b, v24.16b, v25.16b, #8
  348. vshl128 v24, v24, 64
  349. eor v18.16b, v18.16b, v26.16b
  350. eor v17.16b, v17.16b, v25.16b
  351. eor v16.16b, v16.16b, v24.16b
  352. // And then shift the low bits up.
  353. // v16 = // (y'_0; y'_1)
  354. // v17 = // (y'_2; y'_3)
  355. // v18 = // (y'_4; y'_5)
  356. // v19 = // (y'_6; y'_7)
  357. shl v24.2d, v18.2d, #2 // (y'_4; y_5) a_i for t^2
  358. shl v25.2d, v19.2d, #2 // (y_6; y_7) a_i for t^2
  359. shl v26.2d, v18.2d, #5 // (y'_4; y_5) a_i for t^5
  360. shl v27.2d, v19.2d, #5 // (y_6; y_7) a_i for t^5
  361. shl v28.2d, v18.2d, #10 // (y'_4; y_5) a_i for t^10
  362. shl v29.2d, v19.2d, #10 // (y_6; y_7) a_i for t^10
  363. eor v18.16b, v18.16b, v24.16b // mix the contributions together
  364. eor v19.16b, v19.16b, v25.16b
  365. eor v26.16b, v26.16b, v28.16b
  366. eor v27.16b, v27.16b, v29.16b
  367. eor v18.16b, v18.16b, v26.16b
  368. eor v19.16b, v19.16b, v27.16b
  369. eor v0.16b, v16.16b, v18.16b
  370. eor v1.16b, v17.16b, v19.16b
  371. .endm
  372. ///--------------------------------------------------------------------------
  373. /// Main code.
  374. // There are a number of representations of field elements in this code and
  375. // it can be confusing.
  376. //
  377. // * The `external format' consists of a sequence of contiguous bytes in
  378. // memory called a `block'. The GCM spec explains how to interpret this
  379. // block as an element of a finite field. As discussed extensively, this
  380. // representation is very annoying for a number of reasons. On the other
  381. // hand, this code never actually deals with it directly.
  382. //
  383. // * The `register format' consists of one or more SIMD registers,
  384. // depending on the block size. The bits in each byte are reversed,
  385. // compared to the external format, which makes the polynomials
  386. // completely vanilla, unlike all of the other GCM implementations.
  387. //
  388. // * The `table format' is just like the `register format', only the two
  389. // halves of 128-bit SIMD register are the same, so we need twice as many
  390. // registers.
  391. //
  392. // * The `words' format consists of a sequence of bytes, as in the
  393. // `external format', but, according to the blockcipher in use, the bytes
  394. // within each 32-bit word may be reversed (`big-endian') or not
  395. // (`little-endian'). Accordingly, there are separate entry points for
  396. // each variant, identified with `b' or `l'.
  397. FUNC(gcm_mulk_128b_arm64_pmull)
  398. // On entry, x0 points to a 128-bit field element A in big-endian
  399. // words format; x1 points to a field-element K in table format. On
  400. // exit, A is updated with the product A K.
  401. ldr q0, [x0]
  402. ldp q1, q2, [x1]
  403. rev32 v0.16b, v0.16b
  404. vzero
  405. rbit v0.16b, v0.16b
  406. mul128
  407. rbit v0.16b, v0.16b
  408. rev32 v0.16b, v0.16b
  409. str q0, [x0]
  410. ret
  411. ENDFUNC
  412. FUNC(gcm_mulk_128l_arm64_pmull)
  413. // On entry, x0 points to a 128-bit field element A in little-endian
  414. // words format; x1 points to a field-element K in table format. On
  415. // exit, A is updated with the product A K.
  416. ldr q0, [x0]
  417. ldp q1, q2, [x1]
  418. vzero
  419. rbit v0.16b, v0.16b
  420. mul128
  421. rbit v0.16b, v0.16b
  422. str q0, [x0]
  423. ret
  424. ENDFUNC
  425. FUNC(gcm_mulk_64b_arm64_pmull)
  426. // On entry, x0 points to a 64-bit field element A in big-endian
  427. // words format; x1 points to a field-element K in table format. On
  428. // exit, A is updated with the product A K.
  429. ldr d0, [x0]
  430. ldr q1, [x1]
  431. rev32 v0.8b, v0.8b
  432. rbit v0.8b, v0.8b
  433. mul64
  434. rbit x2, x2
  435. ror x2, x2, #32
  436. str x2, [x0]
  437. ret
  438. ENDFUNC
  439. FUNC(gcm_mulk_64l_arm64_pmull)
  440. // On entry, x0 points to a 64-bit field element A in little-endian
  441. // words format; x1 points to a field-element K in table format. On
  442. // exit, A is updated with the product A K.
  443. ldr d0, [x0]
  444. ldr q1, [x1]
  445. rbit v0.8b, v0.8b
  446. mul64
  447. rbit x2, x2
  448. rev x2, x2
  449. str x2, [x0]
  450. ret
  451. ENDFUNC
  452. FUNC(gcm_mulk_96b_arm64_pmull)
  453. // On entry, x0 points to a 96-bit field element A in big-endian
  454. // words format; x1 points to a field-element K in table format. On
  455. // exit, A is updated with the product A K.
  456. ldr w2, [x0, #8]
  457. ldr d0, [x0, #0]
  458. mov v0.d[1], x2
  459. ldp q1, q2, [x1]
  460. rev32 v0.16b, v0.16b
  461. vzero
  462. rbit v0.16b, v0.16b
  463. mul96
  464. rbit v0.16b, v0.16b
  465. rev32 v0.16b, v0.16b
  466. mov w2, v0.s[2]
  467. str d0, [x0, #0]
  468. str w2, [x0, #8]
  469. ret
  470. ENDFUNC
  471. FUNC(gcm_mulk_96l_arm64_pmull)
  472. // On entry, x0 points to a 96-bit field element A in little-endian
  473. // words format; x1 points to a field-element K in table format. On
  474. // exit, A is updated with the product A K.
  475. ldr d0, [x0, #0]
  476. ldr w2, [x0, #8]
  477. mov v0.d[1], x2
  478. ldp q1, q2, [x1]
  479. rbit v0.16b, v0.16b
  480. vzero
  481. mul96
  482. rbit v0.16b, v0.16b
  483. mov w2, v0.s[2]
  484. str d0, [x0, #0]
  485. str w2, [x0, #8]
  486. ret
  487. ENDFUNC
  488. FUNC(gcm_mulk_192b_arm64_pmull)
  489. // On entry, x0 points to a 192-bit field element A in big-endian
  490. // words format; x1 points to a field-element K in table format. On
  491. // exit, A is updated with the product A K.
  492. ldr q0, [x0, #0]
  493. ldr d1, [x0, #16]
  494. ldp q2, q3, [x1, #0]
  495. ldr q4, [x1, #32]
  496. rev32 v0.16b, v0.16b
  497. rev32 v1.8b, v1.8b
  498. rbit v0.16b, v0.16b
  499. rbit v1.8b, v1.8b
  500. vzero
  501. mul192
  502. rev32 v0.16b, v0.16b
  503. rev32 v1.8b, v1.8b
  504. rbit v0.16b, v0.16b
  505. rbit v1.8b, v1.8b
  506. str q0, [x0, #0]
  507. str d1, [x0, #16]
  508. ret
  509. ENDFUNC
  510. FUNC(gcm_mulk_192l_arm64_pmull)
  511. // On entry, x0 points to a 192-bit field element A in little-endian
  512. // words format; x1 points to a field-element K in table format. On
  513. // exit, A is updated with the product A K.
  514. ldr q0, [x0, #0]
  515. ldr d1, [x0, #16]
  516. ldp q2, q3, [x1, #0]
  517. ldr q4, [x1, #32]
  518. rbit v0.16b, v0.16b
  519. rbit v1.8b, v1.8b
  520. vzero
  521. mul192
  522. rbit v0.16b, v0.16b
  523. rbit v1.8b, v1.8b
  524. str q0, [x0, #0]
  525. str d1, [x0, #16]
  526. ret
  527. ENDFUNC
  528. FUNC(gcm_mulk_256b_arm64_pmull)
  529. // On entry, x0 points to a 256-bit field element A in big-endian
  530. // words format; x1 points to a field-element K in table format. On
  531. // exit, A is updated with the product A K.
  532. ldp q0, q1, [x0]
  533. ldp q2, q3, [x1, #0]
  534. ldp q4, q5, [x1, #32]
  535. rev32 v0.16b, v0.16b
  536. rev32 v1.16b, v1.16b
  537. rbit v0.16b, v0.16b
  538. rbit v1.16b, v1.16b
  539. vzero
  540. mul256
  541. rev32 v0.16b, v0.16b
  542. rev32 v1.16b, v1.16b
  543. rbit v0.16b, v0.16b
  544. rbit v1.16b, v1.16b
  545. stp q0, q1, [x0]
  546. ret
  547. ENDFUNC
  548. FUNC(gcm_mulk_256l_arm64_pmull)
  549. // On entry, x0 points to a 256-bit field element A in little-endian
  550. // words format; x1 points to a field-element K in table format. On
  551. // exit, A is updated with the product A K.
  552. ldp q0, q1, [x0]
  553. ldp q2, q3, [x1, #0]
  554. ldp q4, q5, [x1, #32]
  555. rbit v0.16b, v0.16b
  556. rbit v1.16b, v1.16b
  557. vzero
  558. mul256
  559. rbit v0.16b, v0.16b
  560. rbit v1.16b, v1.16b
  561. stp q0, q1, [x0]
  562. ret
  563. ENDFUNC
  564. ///----- That's all, folks --------------------------------------------------