Michael Hamburg mike at shiftleft.org
Tue Jan 20 15:24:21 PST 2015

```> On Jan 20, 2015, at 12:42 PM, Trevor Perrin <trevp at trevp.net> wrote:
>
> On Mon, Jan 19, 2015 at 8:09 PM, Mike Hamburg <mike at shiftleft.org> wrote:
>>
>> The point of the Goldilocks prime is that Karatsuba is a win at that size,
>> and Karatsuba is a pain on full-radix.  So even though Ed448 is suitably
>> aligned for a full-radix implementation on 32 bits, that would take 14^2 =
>> 196 muls vs 16^2 * 3/4 = 192 for Karatsuba.  The extra adds for Karatsuba
>> cost less than the carry handling for full-radix unless you have some sort
>> of accelerator, and I don't think UMAAL is enough of an accelerator to make
>> up that difference. 64-bit is similar except the coefficients aren't
>> aligned.  So it's almost always a win to use Karatsuba and reduced-radix.
>
> Thanks for the analysis.  It seems the crux of your argument is that
> Karatsuba (and other fancy multiplications) favors reduced-radix on
> many of the platforms where Longa was arguing for full-radix?

Well that, and also that the advantage of full-radix diminishes as the primes get larger.  So supporting full-radix is important for ~256-bit, and this strategy is worth trying even on Intel, but for larger primes it becomes less important. It’s still worth considering in design, though.

> If you have time it might be educational to explain why Karatsuba

You have to add the inputs to Karatsuba.  If you’re working with a full-radix number there will be carries.  In a straightforward, completely full-radix case the sum is necessarily longer by one word, which negates much of the advantage.

You could instead break eg a 380-bit number into two 190-bit halves, and while you’d need a carry chain to add those halves you could at least avoid overflow.  But then you'd still have to be careful with sign extensions when computing the product, because the intermediate subtractions in Karatsuba can borrow.  On the other hand, in the usual reduced-radix implementation the subtractions can’t borrow.  This is because if you don’t carry between digits, the operation is equivalent to one on polynomials, and when multiplying two polynomials with positive digits the result has positive digits.  This is still true if you include the reduction step, so long as your prime has all negative coefficients except for the leading one (eg, 2^big-small, or 2^448-2^224-1).

>> This analysis doesn't support Longa's point, since it shows cases where
>> reduced radix is very fast, even faster than full-radix on a
>> comparable-sized prime on a machine where that's fast.
>
> But if a 448-ish prime was chosen to take into account full-radix
> (maybe not as close to a power of 2 as Goldilocks?), that might change
> this analysis?

No, I would expect it to outperform eg 2^444-17 (from Ben Harris’ Pareto list) almost everywhere.

> Even if Goldilocks has similar performance to other 448-bit primes on
> platforms that favor full-radix, that's a weakening of the Goldilocks
> argument that it's getting "extra" security margin over 384 very
> cheaply: it would be getting this "extra" margin, on these platforms,
> at full price.

Could be.  A back-of-the-envelope calculation suggests that 2^379-19 could match Goldilocks in efficiency (n^lg3-wise) on Tegra2, but the implementation strategies are very different so you’d need to test it.  It’s generally true that Goldilocks is especially good on platforms that already have fast reduced-radix arithmetic, like Haswell and ARM NEON.  Maybe if MSR would submit Ted37919 or Ed384-mers to SUPERCOP, we could find out how they compare on platforms which don’t favor reduced-radix like Atom and K10 and Bulldozer; and which disfavor it like Tegra2.

Overall, Goldilocks may not be most efficient prime on any one platform (except maybe NEON), but it has good performance across many platforms, and I have put in quite a bit of work to demonstrate this.  I appreciate Longa et al’s apparent point (I say “apparent” because I’m not them and I wasn’t at RWC) that primes should be chosen with this property.  Their proposed 2^379-19 may work well across platforms, but this needs to be tested.

Cheers,
— Mike

```