The lazy-initialisation of BN_MONT_CTX was serialising all threads, as noted by
Daniel Sands and co at Sandia. This was to handle the case that 2 or more
threads race to lazy-init the same context, but stunted all scalability in the
case where 2 or more threads are doing unrelated things! We favour the latter
case by punishing the former. The init work gets done by each thread that finds
the context to be uninitialised, and we then lock the "set" logic after that
work is done - the winning thread's work gets used, the losing threads throw
away what they've done.
(Imported from upstream's bf43446835bfd3f9abf1898a99ae20f2285320f3)
Initial fork from f2d678e6e89b6508147086610e985d4e8416e867 (1.0.2 beta).
(This change contains substantial changes from the original and
effectively starts a new history.)