[curves] Goldilocks optimizations

Mon Oct 27 19:16:09 PDT 2014

FWIW Mike wrote a good explanation of Goldilocks optimizations:

http://www.ietf.org/mail-archive/web/cfrg/current/msg05373.html

The linebreaks in that link are bad, so I copied the text below:

====

> On Oct 23, 2014, at 1:52 PM, Phillip Hallam-Baker <phill at hallambaker.com> wrote:
>
> Now if you are arguing that 512 will also break stride on a 512 bit architecture then that is an important data point that by the same logic argues for looking at a curve that does fit and maybe Golilocks is actually the one.
>
> But before conceding that point I would like to see more of an explanation of the optimization being used and how much it delivers.
>

I really ought to write this up in a proper paper.  But here’s a sketch.

First of all, many modern implementations use reduced-radix
arithmetic.  So curve25519-donna uses 5x51-bit words, and
ed448-goldilocks uses 8x56-bit words, or 16x28-bit words on 32-bit
platforms.  Let’s consider Goldilocks on 64-bit.

The multiply routine can accumulate several 112-bit products in a
128-bit register pair without carrying.  This is important because
carry handling slows you down even on x86-64 and arm-scalar.  It’s
even more important with vector units such as NEON, because carry
handling is much slower when you have to exchange elements between
vectors.  The 448-bit size is chosen so that the whole multiply
routine can run on a 32-bit platform (like NEON) without propagating
any carries until the end.  For a 64-bit platform, there’s more
headroom and fewer multiplies in that routine, so the next prime up in
the family (2^480 - 2^240 - 1, “Ed480-Ridinghood”) could be used
instead.  But I wanted to ensure that NEON would be fast, so I went
with 448.  Conveniently, at this size the working state for a multiply
fits exactly in the NEON register file.

Reduced radix also speeds up addition and subtraction, which can now
be vectorized.  With too many additions or subtractions, you’ll need
to reduce to avoid overflow (conservatively, of course, since you
can’t usually check in constant-time software).  But in the “hot”
routines like point additions, most numbers only go through at most
one addition or subtraction before being multiplied.  So they don’t
need to be reduced if the multiply routine has a little extra slack.
For Goldilocks, this is true on 64 bit.  On 32-bit, there’s enough
slack for an addition in each multiplicand, but not for two
subtractions unless the implementation uses signed numbers (which it
currently doesn’t).  So it’s on the edge with a small performance
penalty for reductions, and could probably be tuned a bit more.

OK, so that’s one reason that the prime is less than 512 bits: it’s
designed to be represented internally in 512 bits, but some of those
bits are headroom for carries.  It’s probably the biggest prime that
can do this on a 32-bit machine.  (The biggest such “Crandall” prime
is DJB's 2^414-17.)

The other main design point is the “golden Solinas” form p = t^2 - t -
1, where t = 2^224.  I’m calling it “golden” (and Goldilocks) because
t is the golden ratio mod p.  This is a Solinas trinomial, the
sparsest kind of prime except for Mersenne primes like 2^521-1.  The
NIST curves taught us a lesson about Solinas primes: they can be fast,
but the coefficients are constraining.  Since most of the NIST primes
have lots of 32-bit aligned coefficients, they suck on 64-bit machines
and prevent you from reducing the radix to eg 28 bits.

The Goldilocks prime mitigates that problem by having only one
coefficient, right in the middle.  So if you are representing your
numbers in an even number of words, reduction is really simple.  This
makes any word size favorable if it is equal to (full radix) or
slightly less than (reduced radix) a divisor of 448/2: [1, 2, 4, 7, 8,
14, 16, 28, 32, 56, 112, 224].  So while it doesn’t work perfectly on
every word size, it should work well on full-radix 16, full- or
reduced-radix 32, or full-radix 64-bit machines.  Many of the options
involve 2^n limbs in reduced-radix mode, which vectorizes really well.

In short, because it’s a Solinas trinomial, p has the second most
efficient reduction algorithm after Mersenne primes.  You don’t get
quite as many choices of efficient word layout as with a Crandall
prime 2^big - small (which has no internal coefficient), but the
golden-ratio structure is pretty much the best you can get on a
Solinas prime.

There are only a few options for bit lengths of primes of this shape.
The crypto-sized ones are 216, 322, 416, 448, 450 and 480.  Note also
that 2^448 - p ~ sqrt(p).  So security issues such as biases from not
being really close to a power of 2 are always less than 1/sqrt(p),
which is your security target anyway.  Likewise, the curve’s order
will look basically the same as for a Crandall prime.

There’s one more advantage to the golden-ratio structure.  The
Goldilocks prime works very well with Karatsuba multiplication.
Recall that Karatsuba computes (a + bt) (c + dt) = ac + bdt^2 +
((a+b)(c+d)-ac-bd)t, where t is either a polynomial term or a multiple
of the word size.  So it saves one multiplication at the cost of
several additions.  This is usually combined with reduced-radix math
so that you don’t have to handle carries in (a+b) and (c+d).

Let t = 2^224, so that t^2 = t+1 mod the Goldilocks prime.  Reducing
the above, (a+bt)(c+dt) == ac + bd + ((a+b)(c+d)-ac)t.  That is,
reduction mod p makes the Karatsuba formula slightly simpler instead
of more complex.

The products above — ac etc. — will be wider than t, and so will also
need to be reduced.  So it’s also important that this formula
multiplied by t is also simple:

(a+bt)(c+dt)t = (a+b)(c+d)(1+t) + bdt - ac.

The actual math code uses variants of these formulas depending on the
platform, for better locality or pipelining or whatever, but the basic
idea is the same.  Since a and b are 4x56-bit numbers, they can be
added with a single 4-lane vectorized add; likewise for c and d.  All
in all, this Karatsuba technique may or may not be profitable on
AVX-512, but it is definitely profitable on x86-64 and NEON.

So for example, on a 64-bit machine, the Goldilocks field multiply
uses approximately:
* 2x4x64-bit vector adds
* 5 widening multiplies and 43 multiply-accumulates
* 5 128-bit additions and 4 128-bit subtractions
* Two final 64-bit for a total of about 54 ADD/SUB and 52 ADC/SBB.
I’m not looking at asm though, so I probably missed some.
* 10 radix adjustments (mask and shift a 128-bit variable).

The multiplies are easy to interleave at different place values,
because carries don’t propagate.  This improves performance noticeably
on some platforms.

Without Karatsuba, you’d need at least 64 multiply-accumulates.  Or
with packed math at this size, you’d need 49 multiplies (48
accumulates), but the accumulates would be wider, 192 bits, and thus
much slower.

A Goldilocks add takes:
* 2x4x64-bit vector adds
It should be done in a few cycles, maybe less if it’s inline.

By comparison, a 512-bit packed multiply (Comba like in MS NUMS) uses:
* 1 widening multiply
* 3 64x64->128-bit multiply-accumulates
* 60 64x64->192-bit multiply-accumulates, if I’ve calculated right
* 8 small multiply-accumulates (64x64->128 bits) for reduction
* at least one, maybe two rounds of carry propagation after the reduction

This comes out to order of 80 ADD and 140 ADC instructions.  Remember
that MUL isn’t that much more expensive than ADC on recent Intel
cores, and even on AMD it’s only twice as expensive.  It’s still
possible to interleave the multiplications, but it’s trickier because
of carry propagation.

Some platforms have faster carry propagation, including ARM scalar
(UMAAL) and the upcoming Intel Broadwell (MULX/ADCX/ADOX).  But
usually vector units don’t have this, and ARM's NEON vector unit is a
huge win — like triple the performance — if you can use it
efficiently.

A packed NUMSp512 add takes:
* one add
* 7 adds with carry in a chain
* a multplication for the reduction
* probably two rounds of carry propagation

This sort of carry propagation is most of why NUMSp512 takes about
twice as long as Goldilocks.

Overall, these optimizations give a curve which is 64 bits longer than
NUMS p384 with <10% worse performance, and 64 bits shorter than NUMS
p512 with about double the performance.  So if you mock up an
“efficiency” metric which trades off bit length for speed, Goldilocks
scores very well.  Its main competitors in this region would be its
480-bit cousin (but not on 32-bit), or 2^521-1, which I still think is
pretty good despite your criticism.

There are other families of fast primes, as outlined in my email on
prime performance, but at a size of “between about 384 bits and about
512 bits”, Goldilocks is one of only a few really good ones.

Cheers,
— Mike