[curves] E-521

Mon Oct 27 18:46:15 PDT 2014

> On Oct 27, 2014, at 6:24 PM, Trevor Perrin <trevp at trevp.net> wrote:
> 
> On Mon, Oct 27, 2014 at 4:42 PM, Michael Hamburg <mike at shiftleft.org> wrote:
>> I just checked in a (still experimental!) mod to the Goldilocks code which enables p521.  Compile with:
> [...]
>>        ecdh: 803kcy
> 
> Cool, added:
> 
> https://docs.google.com/a/trevp.net/spreadsheet/ccc?key=0Aiexaz_YjIpddFJuWlNZaDBvVTRFSjVYZDdjakxoRkE&usp=sharing#gid=0
> 
> 
> I know you ruled out Ed480-Ridinghood for performance on 32-bit
> systems (I guess 8x60-bit limbs works on 64-bit, but 16x30 isn't
> enough headroom on 32-bit?)
> 
> How do you expect E-521 to perform on 32-bit? (18x29?)
> 
> Trevor

Yeah, so 16x30 won’t work on 32-bit without carry propagation, because there are less than 4 bits of headroom (2^64 / 2^60, but the input isn’t perfectly reduced), but the multiplication and reduction produces coefficients as large as 38.

> ((x^16-1)/(x-1)).numerator()^2 % (x^16 - x^8 - 1)
24*x^15 + 26*x^14 + 28*x^13 + 30*x^12 + 32*x^11 + 34*x^10 + 36*x^9 + 38*x^8 + 16*x^7 + 17*x^6 + 18*x^5 + 19*x^4 + 20*x^3 + 21*x^2 + 22*x + 23

It might be worth trying anyway to see how slow it would be.  Another option is 20x24, which is going to be ~50% slower (20^2 *3/4 instead of 16^2 *3/4 multiplies) but doesn’t have carry handling problems.

For 2^521-1 as 18x29, the biggest coefficient in that reduction is 35, and there are 6 bits of headroom.  So it works, but it requires a reduction before every multiply.  I’m not sure what the best multiply algorithm would be between Karatsuba, 3-way Chung Hasan (like on 64-bit) or the new Granger-Scott formula (which I should try on 64-bit, maybe it’s tuneable).  It probably would work pretty well in NEON, but I don’t think it would be quite as efficient as Goldilocks.

Another option would be 20x26, and deal with the extra bit somehow.

Sorry I don’t have better numbers for you right now.

Cheers,
— Mike