Add MQDSS AVX2 implementations (#288)

* Add AVX2 version of mqdss * Fix duplicate consistency
2020-06-26 08:01:23 +02:00 · 2020-06-26 08:01:23 +02:00 · 8e27bd0915
--- a/crypto_sign/mqdss-48/META.yml
+++ b/crypto_sign/mqdss-48/META.yml
@ -16,3 +16,12 @@ auxiliary-submitters:
 implementations:
    - name: clean
      version: https://github.com/joostrijneveld/MQDSS/commit/00608d7610262ff07b1834885d32bc3fd27ef5e1
+    - name: avx2
+      version: https://github.com/joostrijneveld/MQDSS/commit/00608d7610262ff07b1834885d32bc3fd27ef5e1
+      supported_platforms:
+          - architecture: x86_64
+            required_flags:
+                - avx2
+          - architecture: x86
+            required_flags:
+                - avx2
--- a/crypto_sign/mqdss-48/avx2/LICENSE
+++ b/crypto_sign/mqdss-48/avx2/LICENSE
@ -0,0 +1,116 @@
+CC0 1.0 Universal
+
+Statement of Purpose
+
+The laws of most jurisdictions throughout the world automatically confer
+exclusive Copyright and Related Rights (defined below) upon the creator and
+subsequent owner(s) (each and all, an "owner") of an original work of
+authorship and/or a database (each, a "Work").
+
+Certain owners wish to permanently relinquish those rights to a Work for the
+purpose of contributing to a commons of creative, cultural and scientific
+works ("Commons") that the public can reliably and without fear of later
+claims of infringement build upon, modify, incorporate in other works, reuse
+and redistribute as freely as possible in any form whatsoever and for any
+purposes, including without limitation commercial purposes. These owners may
+contribute to the Commons to promote the ideal of a free culture and the
+further production of creative, cultural and scientific works, or to gain
+reputation or greater distribution for their Work in part through the use and
+efforts of others.
+
+For these and/or other purposes and motivations, and without any expectation
+of additional consideration or compensation, the person associating CC0 with a
+Work (the "Affirmer"), to the extent that he or she is an owner of Copyright
+and Related Rights in the Work, voluntarily elects to apply CC0 to the Work
+and publicly distribute the Work under its terms, with knowledge of his or her
+Copyright and Related Rights in the Work and the meaning and intended legal
+effect of CC0 on those rights.
+
+1. Copyright and Related Rights. A Work made available under CC0 may be
+protected by copyright and related or neighboring rights ("Copyright and
+Related Rights"). Copyright and Related Rights include, but are not limited
+to, the following:
+
+  i. the right to reproduce, adapt, distribute, perform, display, communicate,
+  and translate a Work;
+
+  ii. moral rights retained by the original author(s) and/or performer(s);
+
+  iii. publicity and privacy rights pertaining to a person's image or likeness
+  depicted in a Work;
+
+  iv. rights protecting against unfair competition in regards to a Work,
+  subject to the limitations in paragraph 4(a), below;
+
+  v. rights protecting the extraction, dissemination, use and reuse of data in
+  a Work;
+
+  vi. database rights (such as those arising under Directive 96/9/EC of the
+  European Parliament and of the Council of 11 March 1996 on the legal
+  protection of databases, and under any national implementation thereof,
+  including any amended or successor version of such directive); and
+
+  vii. other similar, equivalent or corresponding rights throughout the world
+  based on applicable law or treaty, and any national implementations thereof.
+
+2. Waiver. To the greatest extent permitted by, but not in contravention of,
+applicable law, Affirmer hereby overtly, fully, permanently, irrevocably and
+unconditionally waives, abandons, and surrenders all of Affirmer's Copyright
+and Related Rights and associated claims and causes of action, whether now
+known or unknown (including existing as well as future claims and causes of
+action), in the Work (i) in all territories worldwide, (ii) for the maximum
+duration provided by applicable law or treaty (including future time
+extensions), (iii) in any current or future medium and for any number of
+copies, and (iv) for any purpose whatsoever, including without limitation
+commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes
+the Waiver for the benefit of each member of the public at large and to the
+detriment of Affirmer's heirs and successors, fully intending that such Waiver
+shall not be subject to revocation, rescission, cancellation, termination, or
+any other legal or equitable action to disrupt the quiet enjoyment of the Work
+by the public as contemplated by Affirmer's express Statement of Purpose.
+
+3. Public License Fallback. Should any part of the Waiver for any reason be
+judged legally invalid or ineffective under applicable law, then the Waiver
+shall be preserved to the maximum extent permitted taking into account
+Affirmer's express Statement of Purpose. In addition, to the extent the Waiver
+is so judged Affirmer hereby grants to each affected person a royalty-free,
+non transferable, non sublicensable, non exclusive, irrevocable and
+unconditional license to exercise Affirmer's Copyright and Related Rights in
+the Work (i) in all territories worldwide, (ii) for the maximum duration
+provided by applicable law or treaty (including future time extensions), (iii)
+in any current or future medium and for any number of copies, and (iv) for any
+purpose whatsoever, including without limitation commercial, advertising or
+promotional purposes (the "License"). The License shall be deemed effective as
+of the date CC0 was applied by Affirmer to the Work. Should any part of the
+License for any reason be judged legally invalid or ineffective under
+applicable law, such partial invalidity or ineffectiveness shall not
+invalidate the remainder of the License, and in such case Affirmer hereby
+affirms that he or she will not (i) exercise any of his or her remaining
+Copyright and Related Rights in the Work or (ii) assert any associated claims
+and causes of action with respect to the Work, in either case contrary to
+Affirmer's express Statement of Purpose.
+
+4. Limitations and Disclaimers.
+
+  a. No trademark or patent rights held by Affirmer are waived, abandoned,
+  surrendered, licensed or otherwise affected by this document.
+
+  b. Affirmer offers the Work as-is and makes no representations or warranties
+  of any kind concerning the Work, express, implied, statutory or otherwise,
+  including without limitation warranties of title, merchantability, fitness
+  for a particular purpose, non infringement, or the absence of latent or
+  other defects, accuracy, or the present or absence of errors, whether or not
+  discoverable, all to the greatest extent permissible under applicable law.
+
+  c. Affirmer disclaims responsibility for clearing rights of other persons
+  that may apply to the Work or any use thereof, including without limitation
+  any person's Copyright and Related Rights in the Work. Further, Affirmer
+  disclaims responsibility for obtaining any necessary consents, permissions
+  or other rights required for any use of the Work.
+
+  d. Affirmer understands and acknowledges that Creative Commons is not a
+  party to this document and has no duty or obligation with respect to this
+  CC0 or use of the Work.
+
+For more information, please see
+<http://creativecommons.org/publicdomain/zero/1.0/>
--- a/crypto_sign/mqdss-48/avx2/Makefile
+++ b/crypto_sign/mqdss-48/avx2/Makefile
@ -0,0 +1,22 @@
+# This Makefile can be used with GNU Make or BSD Make
+
+LIB=libmqdss-48_avx2.a
+
+HEADERS = params.h gf31.h mq.h api.h
+OBJECTS =          gf31.o mq.o sign.o
+
+CFLAGS=-O3 -Wall -Wconversion -Wextra -Wpedantic -Wvla -Werror \
+	   -Wmissing-prototypes -Wredundant-decls -std=c99 -mavx2 \
+	   -I../../../common $(EXTRAFLAGS)
+
+all: $(LIB)
+
+%.o: %.c $(HEADERS)
+	$(CC) $(CFLAGS) -c -o $@ $<
+
+$(LIB): $(OBJECTS)
+	$(AR) -r $@ $(OBJECTS)
+
+clean:
+	$(RM) $(OBJECTS)
+	$(RM) $(LIB)
--- a/crypto_sign/mqdss-48/avx2/Makefile.Microsoft_nmake
+++ b/crypto_sign/mqdss-48/avx2/Makefile.Microsoft_nmake
@ -0,0 +1,19 @@
+# This Makefile can be used with Microsoft Visual Studio's nmake using the command:
+#    nmake /f Makefile.Microsoft_nmake
+
+LIBRARY=libmqdss-48_avx2.lib
+OBJECTS=gf31.obj mq.obj sign.obj
+
+CFLAGS=/nologo /O2 /I ..\..\..\common /W4 /WX /arch:AVX2
+
+all: $(LIBRARY)
+
+# Make sure objects are recompiled if headers change.
+$(OBJECTS): *.h
+
+$(LIBRARY): $(OBJECTS)
+	LIB.EXE /NOLOGO /WX /OUT:$@ $**
+
+clean:
+    -DEL $(OBJECTS)
+    -DEL $(LIBRARY)
--- a/crypto_sign/mqdss-48/avx2/api.h
+++ b/crypto_sign/mqdss-48/avx2/api.h
@ -0,0 +1,47 @@
+#ifndef PQCLEAN_MQDSS48_AVX2_API_H
+#define PQCLEAN_MQDSS48_AVX2_API_H
+
+#include <stddef.h>
+#include <stdint.h>
+
+#define PQCLEAN_MQDSS48_AVX2_CRYPTO_ALGNAME "MQDSS-48"
+
+#define PQCLEAN_MQDSS48_AVX2_CRYPTO_SECRETKEYBYTES 16
+#define PQCLEAN_MQDSS48_AVX2_CRYPTO_PUBLICKEYBYTES 46
+#define PQCLEAN_MQDSS48_AVX2_CRYPTO_BYTES 28400
+
+/*
+ * Generates an MQDSS key pair.
+ */
+int PQCLEAN_MQDSS48_AVX2_crypto_sign_keypair(
+    uint8_t *pk, uint8_t *sk);
+
+/**
+ * Returns an array containing a detached signature.
+ */
+int PQCLEAN_MQDSS48_AVX2_crypto_sign_signature(
+    uint8_t *sig, size_t *siglen,
+    const uint8_t *m, size_t mlen, const uint8_t *sk);
+
+/**
+ * Verifies a detached signature and message under a given public key.
+ */
+int PQCLEAN_MQDSS48_AVX2_crypto_sign_verify(
+    const uint8_t *sig, size_t siglen,
+    const uint8_t *m, size_t mlen, const uint8_t *pk);
+
+/**
+ * Returns an array containing the signature followed by the message.
+ */
+int PQCLEAN_MQDSS48_AVX2_crypto_sign(
+    uint8_t *sm, size_t *smlen,
+    const uint8_t *m, size_t mlen, const uint8_t *sk);
+
+/**
+ * Verifies a given signature-message pair under a given public key.
+ */
+int PQCLEAN_MQDSS48_AVX2_crypto_sign_open(
+    uint8_t *m, size_t *mlen,
+    const uint8_t *sm, size_t smlen, const uint8_t *pk);
+
+#endif
--- a/crypto_sign/mqdss-48/avx2/gf31.c
+++ b/crypto_sign/mqdss-48/avx2/gf31.c
@ -0,0 +1,123 @@
+#include "params.h"
+#include "fips202.h"
+#include "gf31.h"
+#include <immintrin.h>
+#include <stdint.h>
+#include <string.h>
+
+/* Given a vector of N elements in the range [0, 31], this reduces the elements
+   to the range [0, 30] by mapping 31 to 0 (i.e reduction mod 31) */
+void PQCLEAN_MQDSS48_AVX2_vgf31_unique(gf31 *out, gf31 *in) {
+    __m256i x;
+    __m256i _w31 = _mm256_set1_epi16(31);
+    int i;
+
+    for (i = 0; i < (N >> 4); ++i) {
+        x = _mm256_loadu_si256((__m256i const *) (in + 16 * i));
+        x = _mm256_xor_si256(x, _mm256_and_si256(_w31, _mm256_cmpeq_epi16(x, _w31)));
+        _mm256_storeu_si256((__m256i *)(out + i * 16), x);
+    }
+}
+
+/* This function acts on vectors with 64 gf31 elements.
+It performs one reduction step and guarantees output in [0, 30],
+but requires input to be in [0, 32768). */
+void PQCLEAN_MQDSS48_AVX2_vgf31_shorten_unique(gf31 *out, gf31 *in) {
+    __m256i x;
+    __m256i _w2114 = _mm256_set1_epi32(2114 * 65536 + 2114);
+    __m256i _w31 = _mm256_set1_epi16(31);
+    int i;
+
+    for (i = 0; i < (N >> 4); ++i) {
+        x = _mm256_loadu_si256((__m256i const *) (in + 16 * i));
+        x = _mm256_sub_epi16(x, _mm256_mullo_epi16(_w31, _mm256_mulhi_epi16(x, _w2114)));
+        x = _mm256_xor_si256(x, _mm256_and_si256(_w31, _mm256_cmpeq_epi16(x, _w31)));
+        _mm256_storeu_si256((__m256i *)(out + i * 16), x);
+    }
+}
+
+/* Given a seed, samples len gf31 elements (in the range [0, 30]), and places
+   them in a vector of 16-bit elements */
+void PQCLEAN_MQDSS48_AVX2_gf31_nrand(gf31 *out, size_t len, const uint8_t *seed, size_t seedlen) {
+    size_t i = 0, j;
+    shake256ctx shakestate;
+    uint8_t shakeblock[SHAKE256_RATE];
+
+    shake256_absorb(&shakestate, seed, seedlen);
+
+    while (i < len) {
+        shake256_squeezeblocks(shakeblock, 1, &shakestate);
+        for (j = 0; j < SHAKE256_RATE && i < len; j++) {
+            if ((shakeblock[j] & 31) != 31) {
+                out[i] = (shakeblock[j] & 31);
+                i++;
+            }
+        }
+    }
+    shake256_ctx_release(&shakestate);
+}
+
+/* Given a seed, samples len gf31 elements, transposed into unsigned range,
+   i.e. in the range [-15, 15], and places them in an array of 8-bit integers.
+   This is used for the expansion of F, which wants packed elements. */
+void PQCLEAN_MQDSS48_AVX2_gf31_nrand_schar(signed char *out, size_t len, const uint8_t *seed, size_t seedlen) {
+    size_t i = 0, j;
+    shake256ctx shakestate;
+    uint8_t shakeblock[SHAKE256_RATE];
+
+    shake256_absorb(&shakestate, seed, seedlen);
+
+    while (i < len) {
+        shake256_squeezeblocks(shakeblock, 1, &shakestate);
+        for (j = 0; j < SHAKE256_RATE && i < len; j++) {
+            if ((shakeblock[j] & 31) != 31) {
+                out[i] = (signed char)((shakeblock[j] & 31) - 15);
+                i++;
+            }
+        }
+    }
+    shake256_ctx_release(&shakestate);
+
+}
+
+/* Unpacks an array of packed GF31 elements to one element per gf31.
+   Assumes that there is sufficient empty space available at the end of the
+   array to unpack. Can perform in-place. */
+void PQCLEAN_MQDSS48_AVX2_gf31_nunpack(gf31 *out, const uint8_t *in, size_t n) {
+    size_t i;
+    size_t j = ((n * 5) >> 3) - 1;
+    unsigned int d = 0;
+
+    for (i = n; i > 0; i--) {
+        out[i - 1] = (gf31)((in[j] >> d) & 31);
+        d += 5;
+        if (d > 8) {
+            d -= 8;
+            j--;
+            out[i - 1] = (gf31)(out[i - 1] ^ ((in[j] << (5 - d)) & 31));
+        }
+    }
+}
+
+/* Packs an array of GF31 elements from gf31's to concatenated 5-bit values.
+   Assumes that there is sufficient space available to unpack.
+   Can perform in-place. */
+void PQCLEAN_MQDSS48_AVX2_gf31_npack(uint8_t *out, const gf31 *in, size_t n) {
+    unsigned int i = 0;
+    unsigned int j;
+    int d = 3;
+
+    /* There will be ceil(5n / 8) output blocks */
+    memset(out, 0, (size_t)((5 * n + 7) & ~7U) >> 3);
+
+    for (j = 0; j < n; j++) {
+        if (d < 0) {
+            d += 8;
+            out[i] = (uint8_t)((out[i] & (255 << (d - 3))) |
+                               ((in[j] >> (8 - d)) & ~(255 << (d - 3))));
+            i++;
+        }
+        out[i] = (uint8_t)((out[i] & ~(31 << d)) | ((in[j] << d) & (31 << d)));
+        d -= 5;
+    }
+}
--- a/crypto_sign/mqdss-48/avx2/gf31.h
+++ b/crypto_sign/mqdss-48/avx2/gf31.h
@ -0,0 +1,36 @@
+#ifndef MQDSS_GF31_H
+#define MQDSS_GF31_H
+
+#include <stddef.h>
+#include <stdint.h>
+
+typedef unsigned short gf31;
+
+/* Given a vector of elements in the range [0, 31], this reduces the elements
+   to the range [0, 30] by mapping 31 to 0 (i.e reduction mod 31) */
+void PQCLEAN_MQDSS48_AVX2_vgf31_unique(gf31 *out, gf31 *in);
+
+/* Given a vector of 16-bit integers (i.e. in [0, 65535], this reduces the
+   elements to the range [0, 30] by mapping 31 to 0 (i.e reduction mod 31) */
+void PQCLEAN_MQDSS48_AVX2_vgf31_shorten_unique(gf31 *out, gf31 *in);
+
+/* Given a seed, samples len gf31 elements (in the range [0, 30]), and places
+   them in a vector of 16-bit elements */
+void PQCLEAN_MQDSS48_AVX2_gf31_nrand(gf31 *out, size_t len, const uint8_t *seed, size_t seedlen);
+
+/* Given a seed, samples len gf31 elements, transposed into unsigned range,
+   i.e. in the range [-15, 15], and places them in an array of 8-bit integers.
+   This is used for the expansion of F, which wants packed elements. */
+void PQCLEAN_MQDSS48_AVX2_gf31_nrand_schar(signed char *out, size_t len, const uint8_t *seed, size_t seedlen);
+
+/* Unpacks an array of packed GF31 elements to one element per gf31.
+   Assumes that there is sufficient empty space available at the end of the
+   array to unpack. Can perform in-place. */
+void PQCLEAN_MQDSS48_AVX2_gf31_nunpack(gf31 *out, const uint8_t *in, size_t n);
+
+/* Packs an array of GF31 elements from gf31's to concatenated 5-bit values.
+   Assumes that there is sufficient space available to unpack.
+   Can perform in-place. */
+void PQCLEAN_MQDSS48_AVX2_gf31_npack(uint8_t *out, const gf31 *in, size_t n);
+
+#endif
--- a/crypto_sign/mqdss-48/avx2/mq.c
+++ b/crypto_sign/mqdss-48/avx2/mq.c
@ -0,0 +1,251 @@
+#include "mq.h"
+#include "params.h"
+#include <immintrin.h>
+#include <stdio.h>
+
+static inline __m256i reduce_16(__m256i r, __m256i _w31, __m256i _w2114) {
+    __m256i exp = _mm256_mulhi_epi16(r, _w2114);
+    return _mm256_sub_epi16(r, _mm256_mullo_epi16(_w31, exp));
+}
+
+/* Computes all products x_i * x_j, returns in reduced form */
+inline static
+void generate_quadratic_terms( unsigned char *xij, const gf31 *x ) {
+    __m256i mask_2114 = _mm256_set1_epi16( 2114 );
+    __m256i mask_31 = _mm256_set1_epi16( 31 );
+    __m256i xi[4];
+    xi[0] = _mm256_loadu_si256((__m256i const *) (x));
+    xi[1] = _mm256_loadu_si256((__m256i const *) (x + 16));
+    xi[2] = _mm256_loadu_si256((__m256i const *) (x + 32));
+    xi[3] = _mm256_setzero_si256();
+
+    __m256i xixj[4];
+    xixj[0] = _mm256_setzero_si256();
+    xixj[1] = _mm256_setzero_si256();
+    xixj[2] = _mm256_setzero_si256();
+    xixj[3] = _mm256_setzero_si256();
+
+    int k = 0;
+    for (int i = 0; i < 32; i++) {
+        __m256i br_xi = _mm256_set1_epi16( (short)x[i] );
+        for (int j = 0; j <= (i >> 4); j++) {
+            xixj[j] = _mm256_mullo_epi16( xi[j], br_xi );
+            xixj[j] = reduce_16( xixj[j], mask_31, mask_2114 );
+        }
+
+        __m256i r = _mm256_packs_epi16(xixj[0], xixj[1]);
+        r = _mm256_permute4x64_epi64(r, 0xd8);  // 3,1,2,0
+        _mm256_storeu_si256( (__m256i *)( xij + k ), r );
+        k += i + 1;
+    }
+
+    for (int i = 32; i < N; i++) {
+        __m256i br_xi = _mm256_set1_epi16( (short)x[i] );
+        for (int j = 0; j <= (i >> 4); j++) {
+            xixj[j] = _mm256_mullo_epi16( xi[j], br_xi );
+            xixj[j] = reduce_16( xixj[j], mask_31, mask_2114 );
+        }
+
+        __m256i r0 = _mm256_packs_epi16(xixj[0], xixj[1]);
+        r0 = _mm256_permute4x64_epi64(r0, 0xd8);  // 3,1,2,0
+        _mm256_storeu_si256( (__m256i *)( xij + k ), r0 );
+        __m256i r1 = _mm256_packs_epi16(xixj[2], xixj[3]);
+        r1 = _mm256_permute4x64_epi64(r1, 0xd8);  // 3,1,2,0
+        _mm256_storeu_si256( (__m256i *)( xij + 32 + k ), r1 );
+        k += i + 1;
+    }
+}
+
+/* Computes all terms (x_i * y_j) + (x_j * y_i), returns in reduced form */
+inline static
+void generate_xiyj_p_xjyi_terms( unsigned char *xij, const gf31 *x, const gf31 *y ) {
+    __m256i mask_2114 = _mm256_set1_epi16( 2114 );
+    __m256i mask_31 = _mm256_set1_epi16( 31 );
+    __m256i xiyi[4];
+    xiyi[0] = _mm256_xor_si256(_mm256_loadu_si256((__m256i const *) (x)), _mm256_slli_si256( _mm256_loadu_si256((__m256i const *) (y)), 1 ));
+    xiyi[1] = _mm256_xor_si256(_mm256_loadu_si256((__m256i const *) (x + 16)), _mm256_slli_si256( _mm256_loadu_si256((__m256i const *) (y + 16)), 1 ));
+    xiyi[2] = _mm256_xor_si256(_mm256_loadu_si256((__m256i const *) (x + 32)), _mm256_slli_si256( _mm256_loadu_si256((__m256i const *) (y + 32)), 1 ));
+    xiyi[3] = _mm256_setzero_si256();
+
+    __m256i xixj[4];
+    xixj[0] = _mm256_setzero_si256();
+    xixj[1] = _mm256_setzero_si256();
+    xixj[2] = _mm256_setzero_si256();
+    xixj[3] = _mm256_setzero_si256();
+
+    int k = 0;
+    for (int i = 0; i < 32; i++) {
+        __m256i br_yixi = _mm256_set1_epi16( (short)((x[i] << 8)^y[i]) );
+        for (int j = 0; j <= (i >> 4); j++) {
+            xixj[j] = _mm256_maddubs_epi16( xiyi[j], br_yixi );
+            xixj[j] = reduce_16( xixj[j], mask_31, mask_2114 );
+        }
+
+        __m256i r = _mm256_packs_epi16(xixj[0], xixj[1]);
+        r = _mm256_permute4x64_epi64(r, 0xd8);  // 3,1,2,0
+        _mm256_storeu_si256( (__m256i *)( xij + k ), r );
+        k += i + 1;
+    }
+
+    for (int i = 32; i < N; i++) {
+        __m256i br_yixi = _mm256_set1_epi16( (short)((x[i] << 8)^y[i]) );
+        for (int j = 0; j <= (i >> 4); j++) {
+            xixj[j] = _mm256_maddubs_epi16( xiyi[j], br_yixi );
+            xixj[j] = reduce_16( xixj[j], mask_31, mask_2114 );
+        }
+
+        __m256i r0 = _mm256_packs_epi16(xixj[0], xixj[1]);
+        r0 = _mm256_permute4x64_epi64(r0, 0xd8);  // 3,1,2,0
+        _mm256_storeu_si256( (__m256i *)( xij + k ), r0 );
+        __m256i r1 = _mm256_packs_epi16(xixj[2], xixj[3]);
+        r1 = _mm256_permute4x64_epi64(r1, 0xd8);  // 3,1,2,0
+        _mm256_storeu_si256( (__m256i *)( xij + 32 + k ), r1 );
+        k += i + 1;
+    }
+}
+
+#define EVAL_YMM_0(xx) {\
+        __m128i tmp = _mm256_castsi256_si128(xx); \
+        for (int macro_i = 0; macro_i < 8; macro_i++) { \
+            __m256i _xi = _mm256_broadcastw_epi16(tmp); \
+            tmp = _mm_srli_si128(tmp, 2); \
+            for (int macro_j = 0; macro_j < (N/16); macro_j++) { \
+                __m256i coeff = _mm256_loadu_si256((__m256i const *) F); \
+                F += 32; \
+                yy[macro_j] = _mm256_add_epi16(yy[macro_j], _mm256_maddubs_epi16(_xi, coeff)); \
+            } \
+        } \
+    }
+
+#define EVAL_YMM_1(xx) {\
+        __m128i tmp = _mm256_extracti128_si256(xx, 1); \
+        for (int macro_i = 0; macro_i < 8; macro_i++) { \
+            __m256i _xi = _mm256_broadcastw_epi16(tmp); \
+            tmp = _mm_srli_si128(tmp, 2); \
+            for (int macro_j = 0; macro_j < (N/16); macro_j++) { \
+                __m256i coeff = _mm256_loadu_si256((__m256i const *) F); \
+                F += 32; \
+                yy[macro_j] = _mm256_add_epi16(yy[macro_j], _mm256_maddubs_epi16(_xi, coeff)); \
+            } \
+        } \
+    }
+
+#define REDUCE_(yy) { \
+        (yy)[0] = reduce_16((yy)[0], mask_reduce, mask_2114); \
+        (yy)[1] = reduce_16((yy)[1], mask_reduce, mask_2114); \
+        (yy)[2] = reduce_16((yy)[2], mask_reduce, mask_2114); \
+    }
+
+/* Evaluates the MQ function on a vector of N gf31 elements x (expected to be
+   in reduced 5-bit representation). Expects the coefficients in F to be in
+   signed representation (i.e. [-15, 15], packed bytewise).
+   Outputs M gf31 elements in unique 16-bit representation as fx. */
+void PQCLEAN_MQDSS48_AVX2_MQ(gf31 *fx, const gf31 *x, const signed char *F) {
+    __m256i mask_2114 = _mm256_set1_epi32(2114 * 65536 + 2114);
+    __m256i mask_reduce = _mm256_srli_epi16(_mm256_cmpeq_epi16(mask_2114, mask_2114), 11);
+
+    __m256i xi[4];
+    xi[0] = _mm256_loadu_si256((__m256i const *) (x));
+    xi[1] = _mm256_loadu_si256((__m256i const *) (x + 16));
+    xi[2] = _mm256_loadu_si256((__m256i const *) (x + 32));
+    xi[3] = _mm256_setzero_si256();
+
+    __m256i _zero = _mm256_setzero_si256();
+    xi[0] = _mm256_add_epi16(_mm256_and_si256(mask_reduce, _mm256_cmpgt_epi16(_zero, xi[0])), xi[0]);
+    xi[1] = _mm256_add_epi16(_mm256_and_si256(mask_reduce, _mm256_cmpgt_epi16(_zero, xi[1])), xi[1]);
+    xi[2] = _mm256_add_epi16(_mm256_and_si256(mask_reduce, _mm256_cmpgt_epi16(_zero, xi[2])), xi[2]);
+
+    __m256i x1 = _mm256_packs_epi16(xi[0], xi[1]);
+    x1 = _mm256_permute4x64_epi64(x1, 0xd8);  // 3,1,2,0
+    __m256i x2 = _mm256_packs_epi16(xi[2], xi[3]);
+    x2 = _mm256_permute4x64_epi64(x2, 0xd8);  // 3,1,2,0
+
+    __m256i yy[M / 16];
+    yy[0] = _zero;
+    yy[1] = _zero;
+    yy[2] = _zero;
+
+    EVAL_YMM_0(x1)
+    EVAL_YMM_1(x1)
+    EVAL_YMM_0(x2)
+    REDUCE_(yy)
+
+    __m256i xixj[38];
+    generate_quadratic_terms( (unsigned char *) xixj, x );
+    for (int i = 0 ; i < 36 ; i += 2) {
+        EVAL_YMM_0(xixj[i])
+        EVAL_YMM_1(xixj[i])
+        EVAL_YMM_0(xixj[i + 1])
+        EVAL_YMM_1(xixj[i + 1])
+        REDUCE_(yy)
+    }
+    EVAL_YMM_0(xixj[36]) {
+        __m128i tmp = _mm256_extracti128_si256(xixj[36], 1);
+        for (int i = 0; i < 4; i++) {
+            __m256i _xi = _mm256_broadcastw_epi16(tmp);
+            tmp = _mm_srli_si128(tmp, 2);
+            for (int j = 0; j < (N / 16); j++) {
+                __m256i coeff = _mm256_loadu_si256((__m256i const *) F);
+                F += 32;
+                yy[j] = _mm256_add_epi16(yy[j], _mm256_maddubs_epi16(_xi, coeff));
+            }
+        }
+    }
+    REDUCE_(yy)
+
+    yy[0] = _mm256_add_epi16(_mm256_and_si256(mask_reduce, _mm256_cmpgt_epi16(_mm256_setzero_si256(), yy[0])), yy[0]);
+    yy[1] = _mm256_add_epi16(_mm256_and_si256(mask_reduce, _mm256_cmpgt_epi16(_mm256_setzero_si256(), yy[1])), yy[1]);
+    yy[2] = _mm256_add_epi16(_mm256_and_si256(mask_reduce, _mm256_cmpgt_epi16(_mm256_setzero_si256(), yy[2])), yy[2]);
+
+    for (int i = 0; i < (N / 16); ++i) {
+        _mm256_storeu_si256((__m256i *)(fx + i * 16), yy[i]);
+    }
+}
+
+/* Evaluates the bilinear polar form of the MQ function (i.e. G) on a vector of
+   N gf31 elements x (expected to be in reduced 5-bit representation). Expects
+   the coefficients in F to be in signed representation (i.e. [-15, 15], packed
+   bytewise). Outputs M gf31 elements in unique 16-bit representation as fx. */
+void PQCLEAN_MQDSS48_AVX2_G(gf31 *fx, const gf31 *x, const gf31 *y, const signed char *F) {
+    __m256i mask_2114 = _mm256_set1_epi32(2114 * 65536 + 2114);
+    __m256i mask_reduce = _mm256_srli_epi16(_mm256_cmpeq_epi16(mask_2114, mask_2114), 11);
+    __m256i _zero = _mm256_setzero_si256();
+
+    __m256i yy[(M / 16)];
+    yy[0] = _zero;
+    yy[1] = _zero;
+    yy[2] = _zero;
+
+    F += N * M;
+
+    __m256i xixj[38];
+    generate_xiyj_p_xjyi_terms( (unsigned char *) xixj, x, y );
+    for (int i = 0 ; i < 36 ; i += 2) {
+        EVAL_YMM_0(xixj[i])
+        EVAL_YMM_1(xixj[i])
+        EVAL_YMM_0(xixj[i + 1])
+        EVAL_YMM_1(xixj[i + 1])
+        REDUCE_(yy)
+    }
+    EVAL_YMM_0(xixj[36]) {
+        __m128i tmp = _mm256_extracti128_si256(xixj[36], 1);
+        for (int i = 0; i < 4; i++) {
+            __m256i _xi = _mm256_broadcastw_epi16(tmp);
+            tmp = _mm_srli_si128(tmp, 2);
+            for (int j = 0; j < (N / 16); j++) {
+                __m256i coeff = _mm256_loadu_si256((__m256i const *) F);
+                F += 32;
+                yy[j] = _mm256_add_epi16(yy[j], _mm256_maddubs_epi16(_xi, coeff));
+            }
+        }
+    }
+    REDUCE_(yy)
+
+    yy[0] = _mm256_add_epi16(_mm256_and_si256(mask_reduce, _mm256_cmpgt_epi16(_mm256_setzero_si256(), yy[0])), yy[0]);
+    yy[1] = _mm256_add_epi16(_mm256_and_si256(mask_reduce, _mm256_cmpgt_epi16(_mm256_setzero_si256(), yy[1])), yy[1]);
+    yy[2] = _mm256_add_epi16(_mm256_and_si256(mask_reduce, _mm256_cmpgt_epi16(_mm256_setzero_si256(), yy[2])), yy[2]);
+
+    for (int i = 0; i < (N / 16); ++i) {
+        _mm256_storeu_si256((__m256i *)(fx + i * 16), yy[i]);
+    }
+}
--- a/crypto_sign/mqdss-48/avx2/mq.h
+++ b/crypto_sign/mqdss-48/avx2/mq.h
@ -0,0 +1,18 @@
+#ifndef MQDSS_MQ_H
+#define MQDSS_MQ_H
+
+#include "gf31.h"
+
+/* Evaluates the MQ function on a vector of N gf31 elements x (expected to be
+   in reduced 5-bit representation). Expects the coefficients in F to be in
+   signed representation (i.e. [-15, 15], packed bytewise).
+   Outputs M gf31 elements in unique 16-bit representation as fx. */
+void PQCLEAN_MQDSS48_AVX2_MQ(gf31 *fx, const gf31 *x, const signed char *F);
+
+/* Evaluates the bilinear polar form of the MQ function (i.e. G) on a vector of
+   N gf31 elements x (expected to be in reduced 5-bit representation). Expects
+   the coefficients in F to be in signed representation (i.e. [-15, 15], packed
+   bytewise). Outputs M gf31 elements in unique 16-bit representation as fx. */
+void PQCLEAN_MQDSS48_AVX2_G(gf31 *fx, const gf31 *x, const gf31 *y, const signed char *F);
+
+#endif
--- a/crypto_sign/mqdss-48/avx2/params.h
+++ b/crypto_sign/mqdss-48/avx2/params.h
@ -0,0 +1,25 @@
+#ifndef MQDSS_PARAMS_H
+#define MQDSS_PARAMS_H
+
+#define N 48
+#define M N
+#define F_LEN (M * (((N * (N + 1)) >> 1) + N)) /* Number of elements in F */
+
+#define ROUNDS 184
+
+/* Number of bytes that N, M and F_LEN elements require when packed into a byte
+   array, 5-bit elements packed continuously. */
+/* Assumes N and M to be multiples of 8 */
+#define NPACKED_BYTES ((N * 5) >> 3)
+#define MPACKED_BYTES ((M * 5) >> 3)
+#define FPACKED_BYTES ((F_LEN * 5) >> 3)
+
+#define HASH_BYTES 32
+#define SEED_BYTES 16
+#define PK_BYTES (SEED_BYTES + MPACKED_BYTES)
+#define SK_BYTES SEED_BYTES
+
+// R, sigma_0, ROUNDS * (t1, r{0,1}, e1, c, rho)
+#define SIG_LEN (2 * HASH_BYTES + ROUNDS * (2*NPACKED_BYTES + MPACKED_BYTES + HASH_BYTES + HASH_BYTES))
+
+#endif
--- a/crypto_sign/mqdss-48/avx2/sign.c
+++ b/crypto_sign/mqdss-48/avx2/sign.c
@ -0,0 +1,389 @@
+#include <stddef.h>
+#include <stdint.h>
+#include <string.h>
+
+#include "api.h"
+#include "fips202.h"
+#include "gf31.h"
+#include "mq.h"
+#include "params.h"
+#include "randombytes.h"
+
+/* Takes an array of len bytes and computes a hash digest.
+   This is used as a hash function in the Fiat-Shamir transform. */
+static void H(unsigned char *out, const unsigned char *in, const size_t len) {
+    shake256(out, HASH_BYTES, in, len);
+}
+
+/* Takes two arrays of N packed elements and an array of M packed elements,
+   and computes a HASH_BYTES commitment. */
+static void com_0(unsigned char *c,
+                  const unsigned char *rho,
+                  const unsigned char *inn, const unsigned char *inn2,
+                  const unsigned char *inm) {
+    unsigned char buffer[HASH_BYTES + 2 * NPACKED_BYTES + MPACKED_BYTES];
+    memcpy(buffer, rho, HASH_BYTES);
+    memcpy(buffer + HASH_BYTES, inn, NPACKED_BYTES);
+    memcpy(buffer + HASH_BYTES + NPACKED_BYTES, inn2, NPACKED_BYTES);
+    memcpy(buffer + HASH_BYTES + 2 * NPACKED_BYTES, inm, MPACKED_BYTES);
+    shake256(c, HASH_BYTES, buffer, HASH_BYTES + 2 * NPACKED_BYTES + MPACKED_BYTES);
+}
+
+/* Takes an array of N packed elements and an array of M packed elements,
+   and computes a HASH_BYTES commitment. */
+static void com_1(unsigned char *c,
+                  const unsigned char *rho,
+                  const unsigned char *inn, const unsigned char *inm) {
+    unsigned char buffer[HASH_BYTES + NPACKED_BYTES + MPACKED_BYTES];
+    memcpy(buffer, rho, HASH_BYTES);
+    memcpy(buffer + HASH_BYTES, inn, NPACKED_BYTES);
+    memcpy(buffer + HASH_BYTES + NPACKED_BYTES, inm, MPACKED_BYTES);
+    shake256(c, HASH_BYTES, buffer, HASH_BYTES + NPACKED_BYTES + MPACKED_BYTES);
+}
+
+/*
+ * Generates an MQDSS key pair.
+ */
+int PQCLEAN_MQDSS48_AVX2_crypto_sign_keypair(uint8_t *pk, uint8_t *sk) {
+    signed char F[F_LEN];
+    unsigned char skbuf[SEED_BYTES * 2];
+    gf31 sk_gf31[N];
+    gf31 pk_gf31[M];
+
+    // Expand sk to obtain a seed for F and the secret input s.
+    // We also expand to obtain a value for sampling r0, t0 and e0 during
+    //  signature generation, but that is not relevant here.
+    randombytes(sk, SEED_BYTES);
+    shake256(skbuf, SEED_BYTES * 2, sk, SEED_BYTES);
+
+    memcpy(pk, skbuf, SEED_BYTES);
+    PQCLEAN_MQDSS48_AVX2_gf31_nrand_schar(F, F_LEN, pk, SEED_BYTES);
+    PQCLEAN_MQDSS48_AVX2_gf31_nrand(sk_gf31, N, skbuf + SEED_BYTES, SEED_BYTES);
+    PQCLEAN_MQDSS48_AVX2_MQ(pk_gf31, sk_gf31, F);
+    PQCLEAN_MQDSS48_AVX2_vgf31_unique(pk_gf31, pk_gf31);
+    PQCLEAN_MQDSS48_AVX2_gf31_npack(pk + SEED_BYTES, pk_gf31, M);
+
+    return 0;
+}
+
+/**
+ * Returns an array containing a detached signature.
+ */
+int PQCLEAN_MQDSS48_AVX2_crypto_sign_signature(
+    uint8_t *sig, size_t *siglen,
+    const uint8_t *m, size_t mlen, const uint8_t *sk) {
+
+    signed char F[F_LEN];
+    unsigned char skbuf[SEED_BYTES * 4];
+    gf31 pk_gf31[M];
+    unsigned char pk[SEED_BYTES + MPACKED_BYTES];
+    // Concatenated for convenient hashing.
+    unsigned char D_sigma0_h0_sigma1[HASH_BYTES * 3 + ROUNDS * (NPACKED_BYTES + MPACKED_BYTES)];
+    unsigned char *D = D_sigma0_h0_sigma1;
+    unsigned char *sigma0 = D_sigma0_h0_sigma1 + HASH_BYTES;
+    unsigned char *h0 = D_sigma0_h0_sigma1 + 2 * HASH_BYTES;
+    unsigned char *t1packed = D_sigma0_h0_sigma1 + 3 * HASH_BYTES;
+    unsigned char *e1packed = D_sigma0_h0_sigma1 + 3 * HASH_BYTES + ROUNDS * NPACKED_BYTES;
+    shake256ctx shakestate;
+    unsigned char shakeblock[SHAKE256_RATE];
+    unsigned char h1[((ROUNDS + 7) & ~7) >> 3];
+    unsigned char rnd_seed[HASH_BYTES + SEED_BYTES];
+    unsigned char rho[2 * ROUNDS * HASH_BYTES];
+    unsigned char *rho0 = rho;
+    unsigned char *rho1 = rho + ROUNDS * HASH_BYTES;
+    gf31 sk_gf31[N];
+    gf31 rnd[(2 * N + M) * ROUNDS];  // Concatenated for easy RNG.
+    gf31 *r0 = rnd;
+    gf31 *t0 = rnd + N * ROUNDS;
+    gf31 *e0 = rnd + 2 * N * ROUNDS;
+    gf31 r1[N * ROUNDS];
+    gf31 t1[N * ROUNDS];
+    gf31 e1[M * ROUNDS];
+    gf31 gx[M * ROUNDS];
+    unsigned char packbuf0[NPACKED_BYTES];
+    unsigned char packbuf1[NPACKED_BYTES];
+    unsigned char packbuf2[MPACKED_BYTES];
+    unsigned char c[HASH_BYTES * ROUNDS * 2];
+    gf31 alpha;
+    int alpha_count = 0;
+    int b;
+    int i, j;
+    shake256incctx state;
+
+    shake256(skbuf, SEED_BYTES * 4, sk, SEED_BYTES);
+
+    PQCLEAN_MQDSS48_AVX2_gf31_nrand_schar(F, F_LEN, skbuf, SEED_BYTES);
+
+    shake256_inc_init(&state);
+    shake256_inc_absorb(&state, sk, SEED_BYTES);
+    shake256_inc_absorb(&state, m, mlen);
+    shake256_inc_finalize(&state);
+    shake256_inc_squeeze(sig, HASH_BYTES, &state); // Compute R.
+    shake256_inc_ctx_release(&state);
+
+    memcpy(pk, skbuf, SEED_BYTES);
+    PQCLEAN_MQDSS48_AVX2_gf31_nrand(sk_gf31, N, skbuf + SEED_BYTES, SEED_BYTES);
+    PQCLEAN_MQDSS48_AVX2_MQ(pk_gf31, sk_gf31, F);
+    PQCLEAN_MQDSS48_AVX2_vgf31_unique(pk_gf31, pk_gf31);
+    PQCLEAN_MQDSS48_AVX2_gf31_npack(pk + SEED_BYTES, pk_gf31, M);
+
+    shake256_inc_init(&state);
+    shake256_inc_absorb(&state, pk, PK_BYTES);
+    shake256_inc_absorb(&state, sig, HASH_BYTES);
+    shake256_inc_absorb(&state, m, mlen);
+    shake256_inc_finalize(&state);
+    shake256_inc_squeeze(D, HASH_BYTES, &state);
+    shake256_inc_ctx_release(&state);
+
+    sig += HASH_BYTES;  // Compensate for prefixed R.
+
+    memcpy(rnd_seed, skbuf + 2 * SEED_BYTES, SEED_BYTES);
+    memcpy(rnd_seed + SEED_BYTES, D, HASH_BYTES);
+    shake256(rho, 2 * ROUNDS * HASH_BYTES, rnd_seed, SEED_BYTES + HASH_BYTES);
+
+    memcpy(rnd_seed, skbuf + 3 * SEED_BYTES, SEED_BYTES);
+    memcpy(rnd_seed + SEED_BYTES, D, HASH_BYTES);
+    PQCLEAN_MQDSS48_AVX2_gf31_nrand(rnd, (2 * N + M) * ROUNDS, rnd_seed, SEED_BYTES + HASH_BYTES);
+
+    for (i = 0; i < ROUNDS; i++) {
+        for (j = 0; j < N; j++) {
+            r1[j + i * N] = (gf31)(31 + sk_gf31[j] - r0[j + i * N]);
+        }
+        PQCLEAN_MQDSS48_AVX2_G(gx + i * M, t0 + i * N, r1 + i * N, F);
+    }
+    for (i = 0; i < ROUNDS * M; i++) {
+        gx[i] = (gf31)(gx[i] + e0[i]);
+    }
+    for (i = 0; i < ROUNDS; i++) {
+        PQCLEAN_MQDSS48_AVX2_gf31_npack(packbuf0, r0 + i * N, N);
+        PQCLEAN_MQDSS48_AVX2_gf31_npack(packbuf1, t0 + i * N, N);
+        PQCLEAN_MQDSS48_AVX2_gf31_npack(packbuf2, e0 + i * M, M);
+        com_0(c + HASH_BYTES * (2 * i + 0), rho0 + i * HASH_BYTES, packbuf0, packbuf1, packbuf2);
+        PQCLEAN_MQDSS48_AVX2_vgf31_shorten_unique(r1 + i * N, r1 + i * N);
+        PQCLEAN_MQDSS48_AVX2_vgf31_shorten_unique(gx + i * M, gx + i * M);
+        PQCLEAN_MQDSS48_AVX2_gf31_npack(packbuf0, r1 + i * N, N);
+        PQCLEAN_MQDSS48_AVX2_gf31_npack(packbuf1, gx + i * M, M);
+        com_1(c + HASH_BYTES * (2 * i + 1), rho1 + i * HASH_BYTES, packbuf0, packbuf1);
+    }
+
+    H(sigma0, c, HASH_BYTES * ROUNDS * 2);  // Compute sigma_0.
+    shake256_absorb(&shakestate, D_sigma0_h0_sigma1, 2 * HASH_BYTES);
+    shake256_squeezeblocks(shakeblock, 1, &shakestate);
+
+    memcpy(h0, shakeblock, HASH_BYTES);
+
+    memcpy(sig, sigma0, HASH_BYTES);
+    sig += HASH_BYTES;  // Compensate for sigma_0.
+
+    for (i = 0; i < ROUNDS; i++) {
+        do {
+            alpha = shakeblock[alpha_count] & 31;
+            alpha_count++;
+            if (alpha_count == SHAKE256_RATE) {
+                alpha_count = 0;
+                shake256_squeezeblocks(shakeblock, 1, &shakestate);
+            }
+        } while (alpha == 31);
+        for (j = 0; j < N; j++) {
+            t1[i * N + j] = (gf31)(alpha * r0[j + i * N] - t0[j + i * N] + 31);
+        }
+        PQCLEAN_MQDSS48_AVX2_MQ(e1 + i * M, r0 + i * N, F);
+        for (j = 0; j < N; j++) {
+            e1[i * N + j] = (gf31)(alpha * e1[j + i * M] - e0[j + i * M] + 31);
+        }
+        PQCLEAN_MQDSS48_AVX2_vgf31_shorten_unique(t1 + i * N, t1 + i * N);
+        PQCLEAN_MQDSS48_AVX2_vgf31_shorten_unique(e1 + i * N, e1 + i * N);
+    }
+    shake256_ctx_release(&shakestate);
+
+    PQCLEAN_MQDSS48_AVX2_gf31_npack(t1packed, t1, N * ROUNDS);
+    PQCLEAN_MQDSS48_AVX2_gf31_npack(e1packed, e1, M * ROUNDS);
+
+    memcpy(sig, t1packed, NPACKED_BYTES * ROUNDS);
+    sig += NPACKED_BYTES * ROUNDS;
+    memcpy(sig, e1packed, MPACKED_BYTES * ROUNDS);
+    sig += MPACKED_BYTES * ROUNDS;
+
+    shake256(h1, ((ROUNDS + 7) & ~7) >> 3, D_sigma0_h0_sigma1, 3 * HASH_BYTES + ROUNDS * (NPACKED_BYTES + MPACKED_BYTES));
+
+    for (i = 0; i < ROUNDS; i++) {
+        b = (h1[(i >> 3)] >> (i & 7)) & 1;
+        if (b == 0) {
+            PQCLEAN_MQDSS48_AVX2_gf31_npack(sig, r0 + i * N, N);
+        } else if (b == 1) {
+            PQCLEAN_MQDSS48_AVX2_gf31_npack(sig, r1 + i * N, N);
+        }
+        memcpy(sig + NPACKED_BYTES, c + HASH_BYTES * (2 * i + (1 - b)), HASH_BYTES);
+        memcpy(sig + NPACKED_BYTES + HASH_BYTES, rho + (i + b * ROUNDS) * HASH_BYTES, HASH_BYTES);
+        sig += NPACKED_BYTES + 2 * HASH_BYTES;
+    }
+
+    *siglen = SIG_LEN;
+    return 0;
+}
+
+/**
+ * Verifies a detached signature and message under a given public key.
+ */
+int PQCLEAN_MQDSS48_AVX2_crypto_sign_verify(
+    const uint8_t *sig, size_t siglen,
+    const uint8_t *m, size_t mlen, const uint8_t *pk) {
+
+    gf31 r[N];
+    gf31 t[N];
+    gf31 e[M];
+    signed char F[F_LEN];
+    gf31 pk_gf31[M];
+    // Concatenated for convenient hashing.
+    unsigned char D_sigma0_h0_sigma1[HASH_BYTES * 3 + ROUNDS * (NPACKED_BYTES + MPACKED_BYTES)];
+    unsigned char *D = D_sigma0_h0_sigma1;
+    unsigned char *sigma0 = D_sigma0_h0_sigma1 + HASH_BYTES;
+    unsigned char *h0 = D_sigma0_h0_sigma1 + 2 * HASH_BYTES;
+    unsigned char *t1packed = D_sigma0_h0_sigma1 + 3 * HASH_BYTES;
+    unsigned char *e1packed = D_sigma0_h0_sigma1 + 3 * HASH_BYTES + ROUNDS * NPACKED_BYTES;
+    unsigned char h1[((ROUNDS + 7) & ~7) >> 3];
+    unsigned char c[HASH_BYTES * ROUNDS * 2];
+    memset(c, 0, HASH_BYTES * 2);
+    gf31 x[N];
+    gf31 y[M];
+    gf31 z[M];
+    unsigned char packbuf0[NPACKED_BYTES];
+    unsigned char packbuf1[MPACKED_BYTES];
+    shake256ctx shakestate;
+    unsigned char shakeblock[SHAKE256_RATE];
+    int i, j;
+    gf31 alpha;
+    int alpha_count = 0;
+    int b;
+    shake256incctx state;
+
+    if (siglen != SIG_LEN) {
+        return -1;
+    }
+
+    shake256_inc_init(&state);
+    shake256_inc_absorb(&state, pk, PK_BYTES);
+    shake256_inc_absorb(&state, sig, HASH_BYTES);
+    shake256_inc_absorb(&state, m, mlen);
+    shake256_inc_finalize(&state);
+    shake256_inc_squeeze(D, HASH_BYTES, &state);
+    shake256_inc_ctx_release(&state);
+
+    sig += HASH_BYTES;
+
+    PQCLEAN_MQDSS48_AVX2_gf31_nrand_schar(F, F_LEN, pk, SEED_BYTES);
+    pk += SEED_BYTES;
+    PQCLEAN_MQDSS48_AVX2_gf31_nunpack(pk_gf31, pk, M);
+
+    memcpy(sigma0, sig, HASH_BYTES);
+
+    shake256_absorb(&shakestate, D_sigma0_h0_sigma1, 2 * HASH_BYTES);
+    shake256_squeezeblocks(shakeblock, 1, &shakestate);
+
+    memcpy(h0, shakeblock, HASH_BYTES);
+
+    sig += HASH_BYTES;
+
+    memcpy(t1packed, sig, ROUNDS * NPACKED_BYTES);
+    sig += ROUNDS * NPACKED_BYTES;
+    memcpy(e1packed, sig, ROUNDS * MPACKED_BYTES);
+    sig += ROUNDS * MPACKED_BYTES;
+
+    shake256(h1, ((ROUNDS + 7) & ~7) >> 3, D_sigma0_h0_sigma1, 3 * HASH_BYTES + ROUNDS * (NPACKED_BYTES + MPACKED_BYTES));
+
+    for (i = 0; i < ROUNDS; i++) {
+        do {
+            alpha = shakeblock[alpha_count] & 31;
+            alpha_count++;
+            if (alpha_count == SHAKE256_RATE) {
+                alpha_count = 0;
+                shake256_squeezeblocks(shakeblock, 1, &shakestate);
+            }
+        } while (alpha == 31);
+        b = (h1[(i >> 3)] >> (i & 7)) & 1;
+
+        PQCLEAN_MQDSS48_AVX2_gf31_nunpack(r, sig, N);
+        PQCLEAN_MQDSS48_AVX2_gf31_nunpack(t, t1packed + NPACKED_BYTES * i, N);
+        PQCLEAN_MQDSS48_AVX2_gf31_nunpack(e, e1packed + MPACKED_BYTES * i, M);
+
+        if (b == 0) {
+            PQCLEAN_MQDSS48_AVX2_MQ(y, r, F);
+            for (j = 0; j < N; j++) {
+                x[j] = (gf31)(alpha * r[j] - t[j] + 31);
+            }
+            for (j = 0; j < N; j++) {
+                y[j] = (gf31)(alpha * y[j] - e[j] + 31);
+            }
+            PQCLEAN_MQDSS48_AVX2_vgf31_shorten_unique(x, x);
+            PQCLEAN_MQDSS48_AVX2_vgf31_shorten_unique(y, y);
+            PQCLEAN_MQDSS48_AVX2_gf31_npack(packbuf0, x, N);
+            PQCLEAN_MQDSS48_AVX2_gf31_npack(packbuf1, y, M);
+            com_0(c + HASH_BYTES * (2 * i + 0), sig + HASH_BYTES + NPACKED_BYTES, sig, packbuf0, packbuf1);
+        } else {
+            PQCLEAN_MQDSS48_AVX2_MQ(y, r, F);
+            PQCLEAN_MQDSS48_AVX2_G(z, t, r, F);
+            for (j = 0; j < N; j++) {
+                y[j] = (gf31)(alpha * (31 + pk_gf31[j] - y[j]) - z[j] - e[j] + 62);
+            }
+            PQCLEAN_MQDSS48_AVX2_vgf31_shorten_unique(y, y);
+            PQCLEAN_MQDSS48_AVX2_gf31_npack(packbuf0, y, M);
+            com_1(c + HASH_BYTES * (2 * i + 1), sig + HASH_BYTES + NPACKED_BYTES, sig, packbuf0);
+        }
+        memcpy(c + HASH_BYTES * (2 * i + (1 - b)), sig + NPACKED_BYTES, HASH_BYTES);
+        sig += NPACKED_BYTES + 2 * HASH_BYTES;
+    }
+    shake256_ctx_release(&shakestate);
+
+    H(c, c, HASH_BYTES * ROUNDS * 2);
+    if (memcmp(c, sigma0, HASH_BYTES) != 0) {
+        return -1;
+    }
+
+    return 0;
+}
+
+/**
+ * Returns an array containing the signature followed by the message.
+ */
+int PQCLEAN_MQDSS48_AVX2_crypto_sign(
+    uint8_t *sm, size_t *smlen,
+    const uint8_t *m, size_t mlen, const uint8_t *sk) {
+    size_t siglen;
+
+    PQCLEAN_MQDSS48_AVX2_crypto_sign_signature(
+        sm, &siglen, m, mlen, sk);
+
+    memmove(sm + SIG_LEN, m, mlen);
+    *smlen = siglen + mlen;
+
+    return 0;
+}
+
+/**
+ * Verifies a given signature-message pair under a given public key.
+ */
+int PQCLEAN_MQDSS48_AVX2_crypto_sign_open(
+    uint8_t *m, size_t *mlen,
+    const uint8_t *sm, size_t smlen, const uint8_t *pk) {
+    /* The API caller does not necessarily know what size a signature should be
+       but MQDSS signatures are always exactly SIG_LEN. */
+    if (smlen < SIG_LEN) {
+        memset(m, 0, smlen);
+        *mlen = 0;
+        return -1;
+    }
+
+    *mlen = smlen - SIG_LEN;
+
+    if (PQCLEAN_MQDSS48_AVX2_crypto_sign_verify(
+                sm, SIG_LEN, sm + SIG_LEN, *mlen, pk)) {
+        memset(m, 0, smlen);
+        *mlen = 0;
+        return -1;
+    }
+
+    /* If verification was successful, move the message to the right place. */
+    memmove(m, sm + SIG_LEN, *mlen);
+
+    return 0;
+}
--- a/crypto_sign/mqdss-48/clean/sign.c
+++ b/crypto_sign/mqdss-48/clean/sign.c
@ -1,4 +1,3 @@
-#include <assert.h>
 #include <stddef.h>
 #include <stdint.h>
 #include <string.h>
--- a/crypto_sign/mqdss-64/META.yml
+++ b/crypto_sign/mqdss-64/META.yml
@ -16,3 +16,12 @@ auxiliary-submitters:
 implementations:
    - name: clean
      version: https://github.com/joostrijneveld/MQDSS/commit/00608d7610262ff07b1834885d32bc3fd27ef5e1
+    - name: avx2
+      version: https://github.com/joostrijneveld/MQDSS/commit/00608d7610262ff07b1834885d32bc3fd27ef5e1
+      supported_platforms:
+          - architecture: x86_64
+            required_flags:
+                - avx2
+          - architecture: x86
+            required_flags:
+                - avx2
--- a/crypto_sign/mqdss-64/avx2/LICENSE
+++ b/crypto_sign/mqdss-64/avx2/LICENSE
@ -0,0 +1,116 @@
+CC0 1.0 Universal
+
+Statement of Purpose
+
+The laws of most jurisdictions throughout the world automatically confer
+exclusive Copyright and Related Rights (defined below) upon the creator and
+subsequent owner(s) (each and all, an "owner") of an original work of
+authorship and/or a database (each, a "Work").
+
+Certain owners wish to permanently relinquish those rights to a Work for the
+purpose of contributing to a commons of creative, cultural and scientific
+works ("Commons") that the public can reliably and without fear of later
+claims of infringement build upon, modify, incorporate in other works, reuse
+and redistribute as freely as possible in any form whatsoever and for any
+purposes, including without limitation commercial purposes. These owners may
+contribute to the Commons to promote the ideal of a free culture and the
+further production of creative, cultural and scientific works, or to gain
+reputation or greater distribution for their Work in part through the use and
+efforts of others.
+
+For these and/or other purposes and motivations, and without any expectation
+of additional consideration or compensation, the person associating CC0 with a
+Work (the "Affirmer"), to the extent that he or she is an owner of Copyright
+and Related Rights in the Work, voluntarily elects to apply CC0 to the Work
+and publicly distribute the Work under its terms, with knowledge of his or her
+Copyright and Related Rights in the Work and the meaning and intended legal
+effect of CC0 on those rights.
+
+1. Copyright and Related Rights. A Work made available under CC0 may be
+protected by copyright and related or neighboring rights ("Copyright and
+Related Rights"). Copyright and Related Rights include, but are not limited
+to, the following:
+
+  i. the right to reproduce, adapt, distribute, perform, display, communicate,
+  and translate a Work;
+
+  ii. moral rights retained by the original author(s) and/or performer(s);
+
+  iii. publicity and privacy rights pertaining to a person's image or likeness
+  depicted in a Work;
+
+  iv. rights protecting against unfair competition in regards to a Work,
+  subject to the limitations in paragraph 4(a), below;
+
+  v. rights protecting the extraction, dissemination, use and reuse of data in
+  a Work;
+
+  vi. database rights (such as those arising under Directive 96/9/EC of the
+  European Parliament and of the Council of 11 March 1996 on the legal
+  protection of databases, and under any national implementation thereof,
+  including any amended or successor version of such directive); and
+
+  vii. other similar, equivalent or corresponding rights throughout the world
+  based on applicable law or treaty, and any national implementations thereof.
+
+2. Waiver. To the greatest extent permitted by, but not in contravention of,
+applicable law, Affirmer hereby overtly, fully, permanently, irrevocably and
+unconditionally waives, abandons, and surrenders all of Affirmer's Copyright
+and Related Rights and associated claims and causes of action, whether now
+known or unknown (including existing as well as future claims and causes of
+action), in the Work (i) in all territories worldwide, (ii) for the maximum
+duration provided by applicable law or treaty (including future time
+extensions), (iii) in any current or future medium and for any number of
+copies, and (iv) for any purpose whatsoever, including without limitation
+commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes
+the Waiver for the benefit of each member of the public at large and to the
+detriment of Affirmer's heirs and successors, fully intending that such Waiver
+shall not be subject to revocation, rescission, cancellation, termination, or
+any other legal or equitable action to disrupt the quiet enjoyment of the Work
+by the public as contemplated by Affirmer's express Statement of Purpose.
+
+3. Public License Fallback. Should any part of the Waiver for any reason be
+judged legally invalid or ineffective under applicable law, then the Waiver
+shall be preserved to the maximum extent permitted taking into account
+Affirmer's express Statement of Purpose. In addition, to the extent the Waiver
+is so judged Affirmer hereby grants to each affected person a royalty-free,
+non transferable, non sublicensable, non exclusive, irrevocable and
+unconditional license to exercise Affirmer's Copyright and Related Rights in
+the Work (i) in all territories worldwide, (ii) for the maximum duration
+provided by applicable law or treaty (including future time extensions), (iii)
+in any current or future medium and for any number of copies, and (iv) for any
+purpose whatsoever, including without limitation commercial, advertising or
+promotional purposes (the "License"). The License shall be deemed effective as
+of the date CC0 was applied by Affirmer to the Work. Should any part of the
+License for any reason be judged legally invalid or ineffective under
+applicable law, such partial invalidity or ineffectiveness shall not
+invalidate the remainder of the License, and in such case Affirmer hereby
+affirms that he or she will not (i) exercise any of his or her remaining
+Copyright and Related Rights in the Work or (ii) assert any associated claims
+and causes of action with respect to the Work, in either case contrary to
+Affirmer's express Statement of Purpose.
+
+4. Limitations and Disclaimers.
+
+  a. No trademark or patent rights held by Affirmer are waived, abandoned,
+  surrendered, licensed or otherwise affected by this document.
+
+  b. Affirmer offers the Work as-is and makes no representations or warranties
+  of any kind concerning the Work, express, implied, statutory or otherwise,
+  including without limitation warranties of title, merchantability, fitness
+  for a particular purpose, non infringement, or the absence of latent or
+  other defects, accuracy, or the present or absence of errors, whether or not
+  discoverable, all to the greatest extent permissible under applicable law.
+
+  c. Affirmer disclaims responsibility for clearing rights of other persons
+  that may apply to the Work or any use thereof, including without limitation
+  any person's Copyright and Related Rights in the Work. Further, Affirmer
+  disclaims responsibility for obtaining any necessary consents, permissions
+  or other rights required for any use of the Work.
+
+  d. Affirmer understands and acknowledges that Creative Commons is not a
+  party to this document and has no duty or obligation with respect to this
+  CC0 or use of the Work.
+
+For more information, please see
+<http://creativecommons.org/publicdomain/zero/1.0/>
--- a/crypto_sign/mqdss-64/avx2/Makefile
+++ b/crypto_sign/mqdss-64/avx2/Makefile
@ -0,0 +1,22 @@
+# This Makefile can be used with GNU Make or BSD Make
+
+LIB=libmqdss-64_avx2.a
+
+HEADERS = params.h gf31.h mq.h api.h
+OBJECTS =          gf31.o mq.o sign.o
+
+CFLAGS=-O3 -Wall -Wconversion -Wextra -Wpedantic -Wvla -Werror \
+	   -Wmissing-prototypes -Wredundant-decls -std=c99 -mavx2 \
+	   -I../../../common $(EXTRAFLAGS)
+
+all: $(LIB)
+
+%.o: %.c $(HEADERS)
+	$(CC) $(CFLAGS) -c -o $@ $<
+
+$(LIB): $(OBJECTS)
+	$(AR) -r $@ $(OBJECTS)
+
+clean:
+	$(RM) $(OBJECTS)
+	$(RM) $(LIB)
--- a/crypto_sign/mqdss-64/avx2/Makefile.Microsoft_nmake
+++ b/crypto_sign/mqdss-64/avx2/Makefile.Microsoft_nmake
@ -0,0 +1,19 @@
+# This Makefile can be used with Microsoft Visual Studio's nmake using the command:
+#    nmake /f Makefile.Microsoft_nmake
+
+LIBRARY=libmqdss-64_clean.lib
+OBJECTS=gf31.obj mq.obj sign.obj
+
+CFLAGS=/nologo /O2 /I ..\..\..\common /W4 /WX /arch:AVX2
+
+all: $(LIBRARY)
+
+# Make sure objects are recompiled if headers change.
+$(OBJECTS): *.h
+
+$(LIBRARY): $(OBJECTS)
+	LIB.EXE /NOLOGO /WX /OUT:$@ $**
+
+clean:
+    -DEL $(OBJECTS)
+    -DEL $(LIBRARY)
--- a/crypto_sign/mqdss-64/avx2/api.h
+++ b/crypto_sign/mqdss-64/avx2/api.h
@ -0,0 +1,47 @@
+#ifndef PQCLEAN_MQDSS64_AVX2_API_H
+#define PQCLEAN_MQDSS64_AVX2_API_H
+
+#include <stddef.h>
+#include <stdint.h>
+
+#define PQCLEAN_MQDSS64_AVX2_CRYPTO_ALGNAME "MQDSS-64"
+
+#define PQCLEAN_MQDSS64_AVX2_CRYPTO_SECRETKEYBYTES 24
+#define PQCLEAN_MQDSS64_AVX2_CRYPTO_PUBLICKEYBYTES 64
+#define PQCLEAN_MQDSS64_AVX2_CRYPTO_BYTES 59928
+
+/*
+ * Generates an MQDSS key pair.
+ */
+int PQCLEAN_MQDSS64_AVX2_crypto_sign_keypair(
+    uint8_t *pk, uint8_t *sk);
+
+/**
+ * Returns an array containing a detached signature.
+ */
+int PQCLEAN_MQDSS64_AVX2_crypto_sign_signature(
+    uint8_t *sig, size_t *siglen,
+    const uint8_t *m, size_t mlen, const uint8_t *sk);
+
+/**
+ * Verifies a detached signature and message under a given public key.
+ */
+int PQCLEAN_MQDSS64_AVX2_crypto_sign_verify(
+    const uint8_t *sig, size_t siglen,
+    const uint8_t *m, size_t mlen, const uint8_t *pk);
+
+/**
+ * Returns an array containing the signature followed by the message.
+ */
+int PQCLEAN_MQDSS64_AVX2_crypto_sign(
+    uint8_t *sm, size_t *smlen,
+    const uint8_t *m, size_t mlen, const uint8_t *sk);
+
+/**
+ * Verifies a given signature-message pair under a given public key.
+ */
+int PQCLEAN_MQDSS64_AVX2_crypto_sign_open(
+    uint8_t *m, size_t *mlen,
+    const uint8_t *sm, size_t smlen, const uint8_t *pk);
+
+#endif
--- a/crypto_sign/mqdss-64/avx2/gf31.c
+++ b/crypto_sign/mqdss-64/avx2/gf31.c
@ -0,0 +1,128 @@
+#include "params.h"
+#include "fips202.h"
+#include "gf31.h"
+#include <assert.h>
+#include <immintrin.h>
+#include <stdint.h>
+#include <string.h>
+
+/* Given a vector of N elements in the range [0, 31], this reduces the elements
+   to the range [0, 30] by mapping 31 to 0 (i.e reduction mod 31) */
+void PQCLEAN_MQDSS64_AVX2_vgf31_unique(gf31 *out, gf31 *in) {
+    __m256i x;
+    __m256i _w31 = _mm256_set1_epi16(31);
+    int i;
+
+    for (i = 0; i < (N >> 4); ++i) {
+        x = _mm256_loadu_si256((__m256i const *) (in + 16 * i));
+        x = _mm256_xor_si256(x, _mm256_and_si256(_w31, _mm256_cmpeq_epi16(x, _w31)));
+        _mm256_storeu_si256((__m256i *)(out + i * 16), x);
+    }
+}
+
+/* This function acts on vectors with 64 gf31 elements.
+It performs one reduction step and guarantees output in [0, 30],
+but requires input to be in [0, 32768). */
+void PQCLEAN_MQDSS64_AVX2_vgf31_shorten_unique(gf31 *out, gf31 *in) {
+    __m256i x;
+    __m256i _w2114 = _mm256_set1_epi32(2114 * 65536 + 2114);
+    __m256i _w31 = _mm256_set1_epi16(31);
+    int i;
+
+    for (i = 0; i < (N >> 4); ++i) {
+        x = _mm256_loadu_si256((__m256i const *) (in + 16 * i));
+        x = _mm256_sub_epi16(x, _mm256_mullo_epi16(_w31, _mm256_mulhi_epi16(x, _w2114)));
+        x = _mm256_xor_si256(x, _mm256_and_si256(_w31, _mm256_cmpeq_epi16(x, _w31)));
+        _mm256_storeu_si256((__m256i *)(out + i * 16), x);
+    }
+}
+
+/* Given a seed, samples len gf31 elements (in the range [0, 30]), and places
+   them in a vector of 16-bit elements */
+void PQCLEAN_MQDSS64_AVX2_gf31_nrand(gf31 *out, size_t len, const uint8_t *seed, size_t seedlen) {
+    size_t i = 0, j;
+    shake256ctx shakestate;
+    uint8_t shakeblock[SHAKE256_RATE];
+
+    shake256_absorb(&shakestate, seed, seedlen);
+
+    while (i < len) {
+        shake256_squeezeblocks(shakeblock, 1, &shakestate);
+        for (j = 0; j < SHAKE256_RATE && i < len; j++) {
+            if ((shakeblock[j] & 31) != 31) {
+                out[i] = (shakeblock[j] & 31);
+                i++;
+            }
+        }
+    }
+    shake256_ctx_release(&shakestate);
+}
+
+/* Given a seed, samples len gf31 elements, transposed into unsigned range,
+   i.e. in the range [-15, 15], and places them in an array of 8-bit integers.
+   This is used for the expansion of F, which wants packed elements. */
+void PQCLEAN_MQDSS64_AVX2_gf31_nrand_schar(signed char *out, size_t len, const uint8_t *seed, size_t seedlen) {
+    size_t i = 0, j;
+    shake256ctx shakestate;
+    uint8_t shakeblock[SHAKE256_RATE];
+
+    shake256_absorb(&shakestate, seed, seedlen);
+
+    while (i < len) {
+        shake256_squeezeblocks(shakeblock, 1, &shakestate);
+        for (j = 0; j < SHAKE256_RATE && i < len; j++) {
+            if ((shakeblock[j] & 31) != 31) {
+                out[i] = (signed char)((shakeblock[j] & 31) - 15);
+                i++;
+            }
+        }
+    }
+    shake256_ctx_release(&shakestate);
+
+}
+
+/* Unpacks an array of packed GF31 elements to one element per gf31.
+   Assumes that there is sufficient empty space available at the end of the
+   array to unpack. Can perform in-place. */
+void PQCLEAN_MQDSS64_AVX2_gf31_nunpack(gf31 *out, const uint8_t *in, size_t n) {
+    size_t i;
+    size_t j = ((n * 5) >> 3) - 1;
+    unsigned int d = 0;
+
+    for (i = n; i > 0; i--) {
+        out[i - 1] = (gf31)((in[j] >> d) & 31);
+        d += 5;
+        if (d > 8) {
+            d -= 8;
+            j--;
+            out[i - 1] = (gf31)(out[i - 1] ^ ((in[j] << (5 - d)) & 31));
+        }
+    }
+}
+
+/* Packs an array of GF31 elements from gf31's to concatenated 5-bit values.
+   Assumes that there is sufficient space available to unpack.
+   Can perform in-place. */
+void PQCLEAN_MQDSS64_AVX2_gf31_npack(uint8_t *out, const gf31 *in, size_t n) {
+    unsigned int i = 0;
+    unsigned int j;
+    int d = 3;
+
+    for (j = 0; j < n; j++) {
+        assert(in[j] < 31);
+    }
+
+    /* There will be ceil(5n / 8) output blocks */
+    memset(out, 0, (size_t)((5 * n + 7) & ~7U) >> 3);
+
+    for (j = 0; j < n; j++) {
+        if (d < 0) {
+            d += 8;
+            out[i] = (uint8_t)((out[i] & (255 << (d - 3))) |
+                               ((in[j] >> (8 - d)) & ~(255 << (d - 3))));
+            i++;
+        }
+        out[i] = (uint8_t)((out[i] & ~(31 << d)) | ((in[j] << d) & (31 << d)));
+        d -= 5;
+    }
+}
--- a/crypto_sign/mqdss-64/avx2/gf31.h
+++ b/crypto_sign/mqdss-64/avx2/gf31.h
@ -0,0 +1,36 @@
+#ifndef MQDSS_GF31_H
+#define MQDSS_GF31_H
+
+#include <stddef.h>
+#include <stdint.h>
+
+typedef unsigned short gf31;
+
+/* Given a vector of elements in the range [0, 31], this reduces the elements
+   to the range [0, 30] by mapping 31 to 0 (i.e reduction mod 31) */
+void PQCLEAN_MQDSS64_AVX2_vgf31_unique(gf31 *out, gf31 *in);
+
+/* Given a vector of 16-bit integers (i.e. in [0, 65535], this reduces the
+   elements to the range [0, 30] by mapping 31 to 0 (i.e reduction mod 31) */
+void PQCLEAN_MQDSS64_AVX2_vgf31_shorten_unique(gf31 *out, gf31 *in);
+
+/* Given a seed, samples len gf31 elements (in the range [0, 30]), and places
+   them in a vector of 16-bit elements */
+void PQCLEAN_MQDSS64_AVX2_gf31_nrand(gf31 *out, size_t len, const uint8_t *seed, size_t seedlen);
+
+/* Given a seed, samples len gf31 elements, transposed into unsigned range,
+   i.e. in the range [-15, 15], and places them in an array of 8-bit integers.
+   This is used for the expansion of F, which wants packed elements. */
+void PQCLEAN_MQDSS64_AVX2_gf31_nrand_schar(signed char *out, size_t len, const uint8_t *seed, size_t seedlen);
+
+/* Unpacks an array of packed GF31 elements to one element per gf31.
+   Assumes that there is sufficient empty space available at the end of the
+   array to unpack. Can perform in-place. */
+void PQCLEAN_MQDSS64_AVX2_gf31_nunpack(gf31 *out, const uint8_t *in, size_t n);
+
+/* Packs an array of GF31 elements from gf31's to concatenated 5-bit values.
+   Assumes that there is sufficient space available to unpack.
+   Can perform in-place. */
+void PQCLEAN_MQDSS64_AVX2_gf31_npack(uint8_t *out, const gf31 *in, size_t n);
+
+#endif
--- a/crypto_sign/mqdss-64/avx2/mq.c
+++ b/crypto_sign/mqdss-64/avx2/mq.c
@ -0,0 +1,239 @@
+#include "mq.h"
+#include "params.h"
+#include <immintrin.h>
+#include <stdio.h>
+
+static inline __m256i reduce_16(__m256i r, __m256i _w31, __m256i _w2114) {
+    __m256i exp = _mm256_mulhi_epi16(r, _w2114);
+    return _mm256_sub_epi16(r, _mm256_mullo_epi16(_w31, exp));
+}
+
+/* Computes all products x_i * x_j, returns in reduced form */
+inline static
+void generate_quadratic_terms( unsigned char *xij, const gf31 *x ) {
+    __m256i mask_2114 = _mm256_set1_epi16( 2114 );
+    __m256i mask_31 = _mm256_set1_epi16( 31 );
+    __m256i xi[4];
+    xi[0] = _mm256_loadu_si256((__m256i const *) (x));
+    xi[1] = _mm256_loadu_si256((__m256i const *) (x + 16));
+    xi[2] = _mm256_loadu_si256((__m256i const *) (x + 32));
+    xi[3] = _mm256_loadu_si256((__m256i const *) (x + 48));
+
+    __m256i xixj[4];
+    xixj[0] = _mm256_setzero_si256();
+    xixj[1] = _mm256_setzero_si256();
+    xixj[2] = _mm256_setzero_si256();
+    xixj[3] = _mm256_setzero_si256();
+
+    int k = 0;
+    for (int i = 0; i < 32; i++) {
+        __m256i br_xi = _mm256_set1_epi16( (short)x[i] );
+        for (int j = 0; j <= (i >> 4); j++) {
+            xixj[j] = _mm256_mullo_epi16( xi[j], br_xi );
+            xixj[j] = reduce_16( xixj[j], mask_31, mask_2114 );
+        }
+
+        __m256i r = _mm256_packs_epi16(xixj[0], xixj[1]);
+        r = _mm256_permute4x64_epi64(r, 0xd8);  // 3,1,2,0
+        _mm256_storeu_si256( (__m256i *)( xij + k ), r );
+        k += i + 1;
+    }
+
+    for (int i = 32; i < N; i++) {
+        __m256i br_xi = _mm256_set1_epi16( (short)x[i] );
+        for (int j = 0; j <= (i >> 4); j++) {
+            xixj[j] = _mm256_mullo_epi16( xi[j], br_xi );
+            xixj[j] = reduce_16( xixj[j], mask_31, mask_2114 );
+        }
+
+        __m256i r0 = _mm256_packs_epi16(xixj[0], xixj[1]);
+        r0 = _mm256_permute4x64_epi64(r0, 0xd8);  // 3,1,2,0
+        _mm256_storeu_si256( (__m256i *)( xij + k ), r0 );
+        __m256i r1 = _mm256_packs_epi16(xixj[2], xixj[3]);
+        r1 = _mm256_permute4x64_epi64(r1, 0xd8);  // 3,1,2,0
+        _mm256_storeu_si256( (__m256i *)( xij + 32 + k ), r1 );
+        k += i + 1;
+    }
+}
+
+/* Computes all terms (x_i * y_j) + (x_j * y_i), returns in reduced form */
+inline static
+void generate_xiyj_p_xjyi_terms( unsigned char *xij, const gf31 *x, const gf31 *y ) {
+    __m256i mask_2114 = _mm256_set1_epi16( 2114 );
+    __m256i mask_31 = _mm256_set1_epi16( 31 );
+    __m256i xiyi[4];
+    xiyi[0] = _mm256_xor_si256(_mm256_loadu_si256((__m256i const *) (x)), _mm256_slli_si256( _mm256_loadu_si256((__m256i const *) (y)), 1 ));
+    xiyi[1] = _mm256_xor_si256(_mm256_loadu_si256((__m256i const *) (x + 16)), _mm256_slli_si256( _mm256_loadu_si256((__m256i const *) (y + 16)), 1 ));
+    xiyi[2] = _mm256_xor_si256(_mm256_loadu_si256((__m256i const *) (x + 32)), _mm256_slli_si256( _mm256_loadu_si256((__m256i const *) (y + 32)), 1 ));
+    xiyi[3] = _mm256_xor_si256(_mm256_loadu_si256((__m256i const *) (x + 48)), _mm256_slli_si256( _mm256_loadu_si256((__m256i const *) (y + 48)), 1 ));
+
+    __m256i xixj[4];
+    xixj[0] = _mm256_setzero_si256();
+    xixj[1] = _mm256_setzero_si256();
+    xixj[2] = _mm256_setzero_si256();
+    xixj[3] = _mm256_setzero_si256();
+
+    int k = 0;
+    for (int i = 0; i < 32; i++) {
+        __m256i br_yixi = _mm256_set1_epi16( (short)((x[i] << 8)^y[i]) );
+        for (int j = 0; j <= (i >> 4); j++) {
+            xixj[j] = _mm256_maddubs_epi16( xiyi[j], br_yixi );
+            xixj[j] = reduce_16( xixj[j], mask_31, mask_2114 );
+        }
+
+        __m256i r = _mm256_packs_epi16(xixj[0], xixj[1]);
+        r = _mm256_permute4x64_epi64(r, 0xd8);  // 3,1,2,0
+        _mm256_storeu_si256( (__m256i *)( xij + k ), r );
+        k += i + 1;
+    }
+
+    for (int i = 32; i < N; i++) {
+        __m256i br_yixi = _mm256_set1_epi16( (short)((x[i] << 8)^y[i]) );
+        for (int j = 0; j <= (i >> 4); j++) {
+            xixj[j] = _mm256_maddubs_epi16( xiyi[j], br_yixi );
+            xixj[j] = reduce_16( xixj[j], mask_31, mask_2114 );
+        }
+
+        __m256i r0 = _mm256_packs_epi16(xixj[0], xixj[1]);
+        r0 = _mm256_permute4x64_epi64(r0, 0xd8);  // 3,1,2,0
+        _mm256_storeu_si256( (__m256i *)( xij + k ), r0 );
+        __m256i r1 = _mm256_packs_epi16(xixj[2], xixj[3]);
+        r1 = _mm256_permute4x64_epi64(r1, 0xd8);  // 3,1,2,0
+        _mm256_storeu_si256( (__m256i *)( xij + 32 + k ), r1 );
+        k += i + 1;
+    }
+}
+
+#define EVAL_YMM_0(xx) {\
+        __m128i tmp = _mm256_castsi256_si128(xx); \
+        for (int macro_i = 0; macro_i < 8; macro_i++) { \
+            __m256i _xi = _mm256_broadcastw_epi16(tmp); \
+            tmp = _mm_srli_si128(tmp, 2); \
+            for (int macro_j = 0; macro_j < (N/16); macro_j++) { \
+                __m256i coeff = _mm256_loadu_si256((__m256i const *) F); \
+                F += 32; \
+                yy[macro_j] = _mm256_add_epi16(yy[macro_j], _mm256_maddubs_epi16(_xi, coeff)); \
+            } \
+        } \
+    }
+
+#define EVAL_YMM_1(xx) {\
+        __m128i tmp = _mm256_extracti128_si256(xx, 1); \
+        for (int macro_i = 0; macro_i < 8; macro_i++) { \
+            __m256i _xi = _mm256_broadcastw_epi16(tmp); \
+            tmp = _mm_srli_si128(tmp, 2); \
+            for (int macro_j = 0; macro_j < (N/16); macro_j++) { \
+                __m256i coeff = _mm256_loadu_si256((__m256i const *) F); \
+                F += 32; \
+                yy[macro_j] = _mm256_add_epi16(yy[macro_j], _mm256_maddubs_epi16(_xi, coeff)); \
+            } \
+        } \
+    }
+
+#define REDUCE_(yy) { \
+        (yy)[0] = reduce_16((yy)[0], mask_reduce, mask_2114); \
+        (yy)[1] = reduce_16((yy)[1], mask_reduce, mask_2114); \
+        (yy)[2] = reduce_16((yy)[2], mask_reduce, mask_2114); \
+        (yy)[3] = reduce_16((yy)[3], mask_reduce, mask_2114); \
+    }
+
+
+/* Evaluates the MQ function on a vector of N gf31 elements x (expected to be
+   in reduced 5-bit representation). Expects the coefficients in F to be in
+   signed representation (i.e. [-15, 15], packed bytewise).
+   Outputs M gf31 elements in unique 16-bit representation as fx. */
+void PQCLEAN_MQDSS64_AVX2_MQ(gf31 *fx, const gf31 *x, const signed char *F) {
+    __m256i mask_2114 = _mm256_set1_epi32(2114 * 65536 + 2114);
+    __m256i mask_reduce = _mm256_srli_epi16(_mm256_cmpeq_epi16(mask_2114, mask_2114), 11);
+
+    __m256i xi[4];
+    xi[0] = _mm256_loadu_si256((__m256i const *) (x));
+    xi[1] = _mm256_loadu_si256((__m256i const *) (x + 16));
+    xi[2] = _mm256_loadu_si256((__m256i const *) (x + 32));
+    xi[3] = _mm256_loadu_si256((__m256i const *) (x + 48));
+
+    __m256i _zero = _mm256_setzero_si256();
+    xi[0] = _mm256_add_epi16(_mm256_and_si256(mask_reduce, _mm256_cmpgt_epi16(_zero, xi[0])), xi[0]);
+    xi[1] = _mm256_add_epi16(_mm256_and_si256(mask_reduce, _mm256_cmpgt_epi16(_zero, xi[1])), xi[1]);
+    xi[2] = _mm256_add_epi16(_mm256_and_si256(mask_reduce, _mm256_cmpgt_epi16(_zero, xi[2])), xi[2]);
+    xi[3] = _mm256_add_epi16(_mm256_and_si256(mask_reduce, _mm256_cmpgt_epi16(_zero, xi[3])), xi[3]);
+
+    __m256i x1 = _mm256_packs_epi16(xi[0], xi[1]);
+    x1 = _mm256_permute4x64_epi64(x1, 0xd8);  // 3,1,2,0
+    __m256i x2 = _mm256_packs_epi16(xi[2], xi[3]);
+    x2 = _mm256_permute4x64_epi64(x2, 0xd8);  // 3,1,2,0
+
+    __m256i yy[M / 16];
+    yy[0] = _zero;
+    yy[1] = _zero;
+    yy[2] = _zero;
+    yy[3] = _zero;
+
+    EVAL_YMM_0(x1)
+    EVAL_YMM_1(x1)
+    EVAL_YMM_0(x2)
+    EVAL_YMM_1(x2)
+    REDUCE_(yy)
+
+    __m256i xixj[65];
+    generate_quadratic_terms( (unsigned char *) xixj, x );
+    for (int i = 0 ; i < 64 ; i += 2) {
+        EVAL_YMM_0(xixj[i])
+        EVAL_YMM_1(xixj[i])
+        EVAL_YMM_0(xixj[i + 1])
+        EVAL_YMM_1(xixj[i + 1])
+        REDUCE_(yy)
+    }
+    EVAL_YMM_0(xixj[64])
+    EVAL_YMM_1(xixj[64])
+    REDUCE_(yy)
+
+    yy[0] = _mm256_add_epi16(_mm256_and_si256(mask_reduce, _mm256_cmpgt_epi16(_mm256_setzero_si256(), yy[0])), yy[0]);
+    yy[1] = _mm256_add_epi16(_mm256_and_si256(mask_reduce, _mm256_cmpgt_epi16(_mm256_setzero_si256(), yy[1])), yy[1]);
+    yy[2] = _mm256_add_epi16(_mm256_and_si256(mask_reduce, _mm256_cmpgt_epi16(_mm256_setzero_si256(), yy[2])), yy[2]);
+    yy[3] = _mm256_add_epi16(_mm256_and_si256(mask_reduce, _mm256_cmpgt_epi16(_mm256_setzero_si256(), yy[3])), yy[3]);
+
+    for (int i = 0; i < (N / 16); ++i) {
+        _mm256_storeu_si256((__m256i *)(fx + i * 16), yy[i]);
+    }
+}
+
+/* Evaluates the bilinear polar form of the MQ function (i.e. G) on a vector of
+   N gf31 elements x (expected to be in reduced 5-bit representation). Expects
+   the coefficients in F to be in signed representation (i.e. [-15, 15], packed
+   bytewise). Outputs M gf31 elements in unique 16-bit representation as fx. */
+void PQCLEAN_MQDSS64_AVX2_G(gf31 *fx, const gf31 *x, const gf31 *y, const signed char *F) {
+    __m256i mask_2114 = _mm256_set1_epi32(2114 * 65536 + 2114);
+    __m256i mask_reduce = _mm256_srli_epi16(_mm256_cmpeq_epi16(mask_2114, mask_2114), 11);
+    __m256i _zero = _mm256_setzero_si256();
+
+    __m256i yy[(M / 16)];
+    yy[0] = _zero;
+    yy[1] = _zero;
+    yy[2] = _zero;
+    yy[3] = _zero;
+
+    F += N * M;
+
+    __m256i xixj[65];
+    generate_xiyj_p_xjyi_terms( (unsigned char *) xixj, x, y );
+    for (int i = 0 ; i < 64 ; i += 2) {
+        EVAL_YMM_0(xixj[i])
+        EVAL_YMM_1(xixj[i])
+        EVAL_YMM_0(xixj[i + 1])
+        EVAL_YMM_1(xixj[i + 1])
+        REDUCE_(yy)
+    }
+    EVAL_YMM_0(xixj[64])
+    EVAL_YMM_1(xixj[64])
+    REDUCE_(yy)
+
+    yy[0] = _mm256_add_epi16(_mm256_and_si256(mask_reduce, _mm256_cmpgt_epi16(_mm256_setzero_si256(), yy[0])), yy[0]);
+    yy[1] = _mm256_add_epi16(_mm256_and_si256(mask_reduce, _mm256_cmpgt_epi16(_mm256_setzero_si256(), yy[1])), yy[1]);
+    yy[2] = _mm256_add_epi16(_mm256_and_si256(mask_reduce, _mm256_cmpgt_epi16(_mm256_setzero_si256(), yy[2])), yy[2]);
+    yy[3] = _mm256_add_epi16(_mm256_and_si256(mask_reduce, _mm256_cmpgt_epi16(_mm256_setzero_si256(), yy[3])), yy[3]);
+
+    for (int i = 0; i < (N / 16); ++i) {
+        _mm256_storeu_si256((__m256i *)(fx + i * 16), yy[i]);
+    }
+}
--- a/crypto_sign/mqdss-64/avx2/mq.h
+++ b/crypto_sign/mqdss-64/avx2/mq.h
@ -0,0 +1,18 @@
+#ifndef MQDSS_MQ_H
+#define MQDSS_MQ_H
+
+#include "gf31.h"
+
+/* Evaluates the MQ function on a vector of N gf31 elements x (expected to be
+   in reduced 5-bit representation). Expects the coefficients in F to be in
+   signed representation (i.e. [-15, 15], packed bytewise).
+   Outputs M gf31 elements in unique 16-bit representation as fx. */
+void PQCLEAN_MQDSS64_AVX2_MQ(gf31 *fx, const gf31 *x, const signed char *F);
+
+/* Evaluates the bilinear polar form of the MQ function (i.e. G) on a vector of
+   N gf31 elements x (expected to be in reduced 5-bit representation). Expects
+   the coefficients in F to be in signed representation (i.e. [-15, 15], packed
+   bytewise). Outputs M gf31 elements in unique 16-bit representation as fx. */
+void PQCLEAN_MQDSS64_AVX2_G(gf31 *fx, const gf31 *x, const gf31 *y, const signed char *F);
+
+#endif
--- a/crypto_sign/mqdss-64/avx2/params.h
+++ b/crypto_sign/mqdss-64/avx2/params.h
@ -0,0 +1,25 @@
+#ifndef MQDSS_PARAMS_H
+#define MQDSS_PARAMS_H
+
+#define N 64
+#define M N
+#define F_LEN (M * (((N * (N + 1)) >> 1) + N)) /* Number of elements in F */
+
+#define ROUNDS 277
+
+/* Number of bytes that N, M and F_LEN elements require when packed into a byte
+   array, 5-bit elements packed continuously. */
+/* Assumes N and M to be multiples of 8 */
+#define NPACKED_BYTES ((N * 5) >> 3)
+#define MPACKED_BYTES ((M * 5) >> 3)
+#define FPACKED_BYTES ((F_LEN * 5) >> 3)
+
+#define HASH_BYTES 48
+#define SEED_BYTES 24
+#define PK_BYTES (SEED_BYTES + MPACKED_BYTES)
+#define SK_BYTES SEED_BYTES
+
+// R, sigma_0, ROUNDS * (t1, r{0,1}, e1, c, rho)
+#define SIG_LEN (2 * HASH_BYTES + ROUNDS * (2*NPACKED_BYTES + MPACKED_BYTES + HASH_BYTES + HASH_BYTES))
+
+#endif
--- a/crypto_sign/mqdss-64/avx2/sign.c
+++ b/crypto_sign/mqdss-64/avx2/sign.c
@ -0,0 +1,389 @@
+#include <stddef.h>
+#include <stdint.h>
+#include <string.h>
+
+#include "api.h"
+#include "fips202.h"
+#include "gf31.h"
+#include "mq.h"
+#include "params.h"
+#include "randombytes.h"
+
+/* Takes an array of len bytes and computes a hash digest.
+   This is used as a hash function in the Fiat-Shamir transform. */
+static void H(unsigned char *out, const unsigned char *in, const size_t len) {
+    shake256(out, HASH_BYTES, in, len);
+}
+
+/* Takes two arrays of N packed elements and an array of M packed elements,
+   and computes a HASH_BYTES commitment. */
+static void com_0(unsigned char *c,
+                  const unsigned char *rho,
+                  const unsigned char *inn, const unsigned char *inn2,
+                  const unsigned char *inm) {
+    unsigned char buffer[HASH_BYTES + 2 * NPACKED_BYTES + MPACKED_BYTES];
+    memcpy(buffer, rho, HASH_BYTES);
+    memcpy(buffer + HASH_BYTES, inn, NPACKED_BYTES);
+    memcpy(buffer + HASH_BYTES + NPACKED_BYTES, inn2, NPACKED_BYTES);
+    memcpy(buffer + HASH_BYTES + 2 * NPACKED_BYTES, inm, MPACKED_BYTES);
+    shake256(c, HASH_BYTES, buffer, HASH_BYTES + 2 * NPACKED_BYTES + MPACKED_BYTES);
+}
+
+/* Takes an array of N packed elements and an array of M packed elements,
+   and computes a HASH_BYTES commitment. */
+static void com_1(unsigned char *c,
+                  const unsigned char *rho,
+                  const unsigned char *inn, const unsigned char *inm) {
+    unsigned char buffer[HASH_BYTES + NPACKED_BYTES + MPACKED_BYTES];
+    memcpy(buffer, rho, HASH_BYTES);
+    memcpy(buffer + HASH_BYTES, inn, NPACKED_BYTES);
+    memcpy(buffer + HASH_BYTES + NPACKED_BYTES, inm, MPACKED_BYTES);
+    shake256(c, HASH_BYTES, buffer, HASH_BYTES + NPACKED_BYTES + MPACKED_BYTES);
+}
+
+/*
+ * Generates an MQDSS key pair.
+ */
+int PQCLEAN_MQDSS64_AVX2_crypto_sign_keypair(uint8_t *pk, uint8_t *sk) {
+    signed char F[F_LEN];
+    unsigned char skbuf[SEED_BYTES * 2];
+    gf31 sk_gf31[N];
+    gf31 pk_gf31[M];
+
+    // Expand sk to obtain a seed for F and the secret input s.
+    // We also expand to obtain a value for sampling r0, t0 and e0 during
+    //  signature generation, but that is not relevant here.
+    randombytes(sk, SEED_BYTES);
+    shake256(skbuf, SEED_BYTES * 2, sk, SEED_BYTES);
+
+    memcpy(pk, skbuf, SEED_BYTES);
+    PQCLEAN_MQDSS64_AVX2_gf31_nrand_schar(F, F_LEN, pk, SEED_BYTES);
+    PQCLEAN_MQDSS64_AVX2_gf31_nrand(sk_gf31, N, skbuf + SEED_BYTES, SEED_BYTES);
+    PQCLEAN_MQDSS64_AVX2_MQ(pk_gf31, sk_gf31, F);
+    PQCLEAN_MQDSS64_AVX2_vgf31_unique(pk_gf31, pk_gf31);
+    PQCLEAN_MQDSS64_AVX2_gf31_npack(pk + SEED_BYTES, pk_gf31, M);
+
+    return 0;
+}
+
+/**
+ * Returns an array containing a detached signature.
+ */
+int PQCLEAN_MQDSS64_AVX2_crypto_sign_signature(
+    uint8_t *sig, size_t *siglen,
+    const uint8_t *m, size_t mlen, const uint8_t *sk) {
+
+    signed char F[F_LEN];
+    unsigned char skbuf[SEED_BYTES * 4];
+    gf31 pk_gf31[M];
+    unsigned char pk[SEED_BYTES + MPACKED_BYTES];
+    // Concatenated for convenient hashing.
+    unsigned char D_sigma0_h0_sigma1[HASH_BYTES * 3 + ROUNDS * (NPACKED_BYTES + MPACKED_BYTES)];
+    unsigned char *D = D_sigma0_h0_sigma1;
+    unsigned char *sigma0 = D_sigma0_h0_sigma1 + HASH_BYTES;
+    unsigned char *h0 = D_sigma0_h0_sigma1 + 2 * HASH_BYTES;
+    unsigned char *t1packed = D_sigma0_h0_sigma1 + 3 * HASH_BYTES;
+    unsigned char *e1packed = D_sigma0_h0_sigma1 + 3 * HASH_BYTES + ROUNDS * NPACKED_BYTES;
+    shake256ctx shakestate;
+    unsigned char shakeblock[SHAKE256_RATE];
+    unsigned char h1[((ROUNDS + 7) & ~7) >> 3];
+    unsigned char rnd_seed[HASH_BYTES + SEED_BYTES];
+    unsigned char rho[2 * ROUNDS * HASH_BYTES];
+    unsigned char *rho0 = rho;
+    unsigned char *rho1 = rho + ROUNDS * HASH_BYTES;
+    gf31 sk_gf31[N];
+    gf31 rnd[(2 * N + M) * ROUNDS];  // Concatenated for easy RNG.
+    gf31 *r0 = rnd;
+    gf31 *t0 = rnd + N * ROUNDS;
+    gf31 *e0 = rnd + 2 * N * ROUNDS;
+    gf31 r1[N * ROUNDS];
+    gf31 t1[N * ROUNDS];
+    gf31 e1[M * ROUNDS];
+    gf31 gx[M * ROUNDS];
+    unsigned char packbuf0[NPACKED_BYTES];
+    unsigned char packbuf1[NPACKED_BYTES];
+    unsigned char packbuf2[MPACKED_BYTES];
+    unsigned char c[HASH_BYTES * ROUNDS * 2];
+    gf31 alpha;
+    int alpha_count = 0;
+    int b;
+    int i, j;
+    shake256incctx state;
+
+    shake256(skbuf, SEED_BYTES * 4, sk, SEED_BYTES);
+
+    PQCLEAN_MQDSS64_AVX2_gf31_nrand_schar(F, F_LEN, skbuf, SEED_BYTES);
+
+    shake256_inc_init(&state);
+    shake256_inc_absorb(&state, sk, SEED_BYTES);
+    shake256_inc_absorb(&state, m, mlen);
+    shake256_inc_finalize(&state);
+    shake256_inc_squeeze(sig, HASH_BYTES, &state); // Compute R.
+    shake256_inc_ctx_release(&state);
+
+    memcpy(pk, skbuf, SEED_BYTES);
+    PQCLEAN_MQDSS64_AVX2_gf31_nrand(sk_gf31, N, skbuf + SEED_BYTES, SEED_BYTES);
+    PQCLEAN_MQDSS64_AVX2_MQ(pk_gf31, sk_gf31, F);
+    PQCLEAN_MQDSS64_AVX2_vgf31_unique(pk_gf31, pk_gf31);
+    PQCLEAN_MQDSS64_AVX2_gf31_npack(pk + SEED_BYTES, pk_gf31, M);
+
+    shake256_inc_init(&state);
+    shake256_inc_absorb(&state, pk, PK_BYTES);
+    shake256_inc_absorb(&state, sig, HASH_BYTES);
+    shake256_inc_absorb(&state, m, mlen);
+    shake256_inc_finalize(&state);
+    shake256_inc_squeeze(D, HASH_BYTES, &state);
+    shake256_inc_ctx_release(&state);
+
+    sig += HASH_BYTES;  // Compensate for prefixed R.
+
+    memcpy(rnd_seed, skbuf + 2 * SEED_BYTES, SEED_BYTES);
+    memcpy(rnd_seed + SEED_BYTES, D, HASH_BYTES);
+    shake256(rho, 2 * ROUNDS * HASH_BYTES, rnd_seed, SEED_BYTES + HASH_BYTES);
+
+    memcpy(rnd_seed, skbuf + 3 * SEED_BYTES, SEED_BYTES);
+    memcpy(rnd_seed + SEED_BYTES, D, HASH_BYTES);
+    PQCLEAN_MQDSS64_AVX2_gf31_nrand(rnd, (2 * N + M) * ROUNDS, rnd_seed, SEED_BYTES + HASH_BYTES);
+
+    for (i = 0; i < ROUNDS; i++) {
+        for (j = 0; j < N; j++) {
+            r1[j + i * N] = (gf31)(31 + sk_gf31[j] - r0[j + i * N]);
+        }
+        PQCLEAN_MQDSS64_AVX2_G(gx + i * M, t0 + i * N, r1 + i * N, F);
+    }
+    for (i = 0; i < ROUNDS * M; i++) {
+        gx[i] = (gf31)(gx[i] + e0[i]);
+    }
+    for (i = 0; i < ROUNDS; i++) {
+        PQCLEAN_MQDSS64_AVX2_gf31_npack(packbuf0, r0 + i * N, N);
+        PQCLEAN_MQDSS64_AVX2_gf31_npack(packbuf1, t0 + i * N, N);
+        PQCLEAN_MQDSS64_AVX2_gf31_npack(packbuf2, e0 + i * M, M);
+        com_0(c + HASH_BYTES * (2 * i + 0), rho0 + i * HASH_BYTES, packbuf0, packbuf1, packbuf2);
+        PQCLEAN_MQDSS64_AVX2_vgf31_shorten_unique(r1 + i * N, r1 + i * N);
+        PQCLEAN_MQDSS64_AVX2_vgf31_shorten_unique(gx + i * M, gx + i * M);
+        PQCLEAN_MQDSS64_AVX2_gf31_npack(packbuf0, r1 + i * N, N);
+        PQCLEAN_MQDSS64_AVX2_gf31_npack(packbuf1, gx + i * M, M);
+        com_1(c + HASH_BYTES * (2 * i + 1), rho1 + i * HASH_BYTES, packbuf0, packbuf1);
+    }
+
+    H(sigma0, c, HASH_BYTES * ROUNDS * 2);  // Compute sigma_0.
+    shake256_absorb(&shakestate, D_sigma0_h0_sigma1, 2 * HASH_BYTES);
+    shake256_squeezeblocks(shakeblock, 1, &shakestate);
+
+    memcpy(h0, shakeblock, HASH_BYTES);
+
+    memcpy(sig, sigma0, HASH_BYTES);
+    sig += HASH_BYTES;  // Compensate for sigma_0.
+
+    for (i = 0; i < ROUNDS; i++) {
+        do {
+            alpha = shakeblock[alpha_count] & 31;
+            alpha_count++;
+            if (alpha_count == SHAKE256_RATE) {
+                alpha_count = 0;
+                shake256_squeezeblocks(shakeblock, 1, &shakestate);
+            }
+        } while (alpha == 31);
+        for (j = 0; j < N; j++) {
+            t1[i * N + j] = (gf31)(alpha * r0[j + i * N] - t0[j + i * N] + 31);
+        }
+        PQCLEAN_MQDSS64_AVX2_MQ(e1 + i * M, r0 + i * N, F);
+        for (j = 0; j < N; j++) {
+            e1[i * N + j] = (gf31)(alpha * e1[j + i * M] - e0[j + i * M] + 31);
+        }
+        PQCLEAN_MQDSS64_AVX2_vgf31_shorten_unique(t1 + i * N, t1 + i * N);
+        PQCLEAN_MQDSS64_AVX2_vgf31_shorten_unique(e1 + i * N, e1 + i * N);
+    }
+    shake256_ctx_release(&shakestate);
+
+    PQCLEAN_MQDSS64_AVX2_gf31_npack(t1packed, t1, N * ROUNDS);
+    PQCLEAN_MQDSS64_AVX2_gf31_npack(e1packed, e1, M * ROUNDS);
+
+    memcpy(sig, t1packed, NPACKED_BYTES * ROUNDS);
+    sig += NPACKED_BYTES * ROUNDS;
+    memcpy(sig, e1packed, MPACKED_BYTES * ROUNDS);
+    sig += MPACKED_BYTES * ROUNDS;
+
+    shake256(h1, ((ROUNDS + 7) & ~7) >> 3, D_sigma0_h0_sigma1, 3 * HASH_BYTES + ROUNDS * (NPACKED_BYTES + MPACKED_BYTES));
+
+    for (i = 0; i < ROUNDS; i++) {
+        b = (h1[(i >> 3)] >> (i & 7)) & 1;
+        if (b == 0) {
+            PQCLEAN_MQDSS64_AVX2_gf31_npack(sig, r0 + i * N, N);
+        } else if (b == 1) {
+            PQCLEAN_MQDSS64_AVX2_gf31_npack(sig, r1 + i * N, N);
+        }
+        memcpy(sig + NPACKED_BYTES, c + HASH_BYTES * (2 * i + (1 - b)), HASH_BYTES);
+        memcpy(sig + NPACKED_BYTES + HASH_BYTES, rho + (i + b * ROUNDS) * HASH_BYTES, HASH_BYTES);
+        sig += NPACKED_BYTES + 2 * HASH_BYTES;
+    }
+
+    *siglen = SIG_LEN;
+    return 0;
+}
+
+/**
+ * Verifies a detached signature and message under a given public key.
+ */
+int PQCLEAN_MQDSS64_AVX2_crypto_sign_verify(
+    const uint8_t *sig, size_t siglen,
+    const uint8_t *m, size_t mlen, const uint8_t *pk) {
+
+    gf31 r[N];
+    gf31 t[N];
+    gf31 e[M];
+    signed char F[F_LEN];
+    gf31 pk_gf31[M];
+    // Concatenated for convenient hashing.
+    unsigned char D_sigma0_h0_sigma1[HASH_BYTES * 3 + ROUNDS * (NPACKED_BYTES + MPACKED_BYTES)];
+    unsigned char *D = D_sigma0_h0_sigma1;
+    unsigned char *sigma0 = D_sigma0_h0_sigma1 + HASH_BYTES;
+    unsigned char *h0 = D_sigma0_h0_sigma1 + 2 * HASH_BYTES;
+    unsigned char *t1packed = D_sigma0_h0_sigma1 + 3 * HASH_BYTES;
+    unsigned char *e1packed = D_sigma0_h0_sigma1 + 3 * HASH_BYTES + ROUNDS * NPACKED_BYTES;
+    unsigned char h1[((ROUNDS + 7) & ~7) >> 3];
+    unsigned char c[HASH_BYTES * ROUNDS * 2];
+    memset(c, 0, HASH_BYTES * 2);
+    gf31 x[N];
+    gf31 y[M];
+    gf31 z[M];
+    unsigned char packbuf0[NPACKED_BYTES];
+    unsigned char packbuf1[MPACKED_BYTES];
+    shake256ctx shakestate;
+    unsigned char shakeblock[SHAKE256_RATE];
+    int i, j;
+    gf31 alpha;
+    int alpha_count = 0;
+    int b;
+    shake256incctx state;
+
+    if (siglen != SIG_LEN) {
+        return -1;
+    }
+
+    shake256_inc_init(&state);
+    shake256_inc_absorb(&state, pk, PK_BYTES);
+    shake256_inc_absorb(&state, sig, HASH_BYTES);
+    shake256_inc_absorb(&state, m, mlen);
+    shake256_inc_finalize(&state);
+    shake256_inc_squeeze(D, HASH_BYTES, &state);
+    shake256_inc_ctx_release(&state);
+
+    sig += HASH_BYTES;
+
+    PQCLEAN_MQDSS64_AVX2_gf31_nrand_schar(F, F_LEN, pk, SEED_BYTES);
+    pk += SEED_BYTES;
+    PQCLEAN_MQDSS64_AVX2_gf31_nunpack(pk_gf31, pk, M);
+
+    memcpy(sigma0, sig, HASH_BYTES);
+
+    shake256_absorb(&shakestate, D_sigma0_h0_sigma1, 2 * HASH_BYTES);
+    shake256_squeezeblocks(shakeblock, 1, &shakestate);
+
+    memcpy(h0, shakeblock, HASH_BYTES);
+
+    sig += HASH_BYTES;
+
+    memcpy(t1packed, sig, ROUNDS * NPACKED_BYTES);
+    sig += ROUNDS * NPACKED_BYTES;
+    memcpy(e1packed, sig, ROUNDS * MPACKED_BYTES);
+    sig += ROUNDS * MPACKED_BYTES;
+
+    shake256(h1, ((ROUNDS + 7) & ~7) >> 3, D_sigma0_h0_sigma1, 3 * HASH_BYTES + ROUNDS * (NPACKED_BYTES + MPACKED_BYTES));
+
+    for (i = 0; i < ROUNDS; i++) {
+        do {
+            alpha = shakeblock[alpha_count] & 31;
+            alpha_count++;
+            if (alpha_count == SHAKE256_RATE) {
+                alpha_count = 0;
+                shake256_squeezeblocks(shakeblock, 1, &shakestate);
+            }
+        } while (alpha == 31);
+        b = (h1[(i >> 3)] >> (i & 7)) & 1;
+
+        PQCLEAN_MQDSS64_AVX2_gf31_nunpack(r, sig, N);
+        PQCLEAN_MQDSS64_AVX2_gf31_nunpack(t, t1packed + NPACKED_BYTES * i, N);
+        PQCLEAN_MQDSS64_AVX2_gf31_nunpack(e, e1packed + MPACKED_BYTES * i, M);
+
+        if (b == 0) {
+            PQCLEAN_MQDSS64_AVX2_MQ(y, r, F);
+            for (j = 0; j < N; j++) {
+                x[j] = (gf31)(alpha * r[j] - t[j] + 31);
+            }
+            for (j = 0; j < N; j++) {
+                y[j] = (gf31)(alpha * y[j] - e[j] + 31);
+            }
+            PQCLEAN_MQDSS64_AVX2_vgf31_shorten_unique(x, x);
+            PQCLEAN_MQDSS64_AVX2_vgf31_shorten_unique(y, y);
+            PQCLEAN_MQDSS64_AVX2_gf31_npack(packbuf0, x, N);
+            PQCLEAN_MQDSS64_AVX2_gf31_npack(packbuf1, y, M);
+            com_0(c + HASH_BYTES * (2 * i + 0), sig + HASH_BYTES + NPACKED_BYTES, sig, packbuf0, packbuf1);
+        } else {
+            PQCLEAN_MQDSS64_AVX2_MQ(y, r, F);
+            PQCLEAN_MQDSS64_AVX2_G(z, t, r, F);
+            for (j = 0; j < N; j++) {
+                y[j] = (gf31)(alpha * (31 + pk_gf31[j] - y[j]) - z[j] - e[j] + 62);
+            }
+            PQCLEAN_MQDSS64_AVX2_vgf31_shorten_unique(y, y);
+            PQCLEAN_MQDSS64_AVX2_gf31_npack(packbuf0, y, M);
+            com_1(c + HASH_BYTES * (2 * i + 1), sig + HASH_BYTES + NPACKED_BYTES, sig, packbuf0);
+        }
+        memcpy(c + HASH_BYTES * (2 * i + (1 - b)), sig + NPACKED_BYTES, HASH_BYTES);
+        sig += NPACKED_BYTES + 2 * HASH_BYTES;
+    }
+    shake256_ctx_release(&shakestate);
+
+    H(c, c, HASH_BYTES * ROUNDS * 2);
+    if (memcmp(c, sigma0, HASH_BYTES) != 0) {
+        return -1;
+    }
+
+    return 0;
+}
+
+/**
+ * Returns an array containing the signature followed by the message.
+ */
+int PQCLEAN_MQDSS64_AVX2_crypto_sign(
+    uint8_t *sm, size_t *smlen,
+    const uint8_t *m, size_t mlen, const uint8_t *sk) {
+    size_t siglen;
+
+    PQCLEAN_MQDSS64_AVX2_crypto_sign_signature(
+        sm, &siglen, m, mlen, sk);
+
+    memmove(sm + SIG_LEN, m, mlen);
+    *smlen = siglen + mlen;
+
+    return 0;
+}
+
+/**
+ * Verifies a given signature-message pair under a given public key.
+ */
+int PQCLEAN_MQDSS64_AVX2_crypto_sign_open(
+    uint8_t *m, size_t *mlen,
+    const uint8_t *sm, size_t smlen, const uint8_t *pk) {
+    /* The API caller does not necessarily know what size a signature should be
+       but MQDSS signatures are always exactly SIG_LEN. */
+    if (smlen < SIG_LEN) {
+        memset(m, 0, smlen);
+        *mlen = 0;
+        return -1;
+    }
+
+    *mlen = smlen - SIG_LEN;
+
+    if (PQCLEAN_MQDSS64_AVX2_crypto_sign_verify(
+                sm, SIG_LEN, sm + SIG_LEN, *mlen, pk)) {
+        memset(m, 0, smlen);
+        *mlen = 0;
+        return -1;
+    }
+
+    /* If verification was successful, move the message to the right place. */
+    memmove(m, sm + SIG_LEN, *mlen);
+
+    return 0;
+}
--- a/crypto_sign/mqdss-64/clean/sign.c
+++ b/crypto_sign/mqdss-64/clean/sign.c
@ -1,4 +1,3 @@
-#include <assert.h>
 #include <stddef.h>
 #include <stdint.h>
 #include <string.h>
--- a/test/duplicate_consistency/mqdss-48_clean.yml
+++ b/test/duplicate_consistency/mqdss-48_clean.yml
@ -0,0 +1,20 @@
+consistency_checks:
+- source:
+    scheme: mqdss-48
+    implementation: avx2
+  files:
+    - api.h
+    - mq.h
+    - LICENSE
+    - mq.h
+    - sign.c
+    - params.h
+- source:
+    scheme: mqdss-64
+    implementation: clean
+  files:
+    - gf31.c
+    - gf31.h
+    - LICENSE
+    - mq.c
+    - mq.h
--- a/test/duplicate_consistency/mqdss-64_clean.yml
+++ b/test/duplicate_consistency/mqdss-64_clean.yml
@ -9,3 +9,14 @@ consistency_checks:
    - mq.c
    - mq.h
    - sign.c
+- source:
+    scheme: mqdss-64
+    implementation: avx2
+  files:
+    - api.h
+    - mq.h
+    - LICENSE
+    - mq.h
+    - sign.c
+    - params.h
+
--- a/test/test_testvectors.py
+++ b/test/test_testvectors.py
@ -40,6 +40,7 @@ def test_testvectors(implementation, impl_path, test_dir, init, destr):
            implementation.name,
            '.exe' if os.name == 'nt' else ''
        ))],
+        print_output=False,
    ).replace('\r', '')
    assert(implementation.scheme.metadata()['testvectors-sha256'].lower()
           == hashlib.sha256(out.encode('utf-8')).hexdigest().lower())