formatting, moves constant values to consts.go, etc.
* Code is much slower than x86 specialized implementation * Cross checked on ARMv8 (32-bit)