[curves] The great debate over point formats

Mike Hamburg mike at shiftleft.org
Fri Jan 31 00:07:44 PST 2014

```On Jan 30, 2014, at 10:45 PM, Robert Ransom <rransom.8774 at gmail.com> wrote:

> On 1/30/14, Diego Aranha <dfaranha at gmail.com> wrote:
>
>> Your comments are very relevant, but let me justify some of our design
>> choices. We picked field sizes similar to NIST curves, trying to provide
>> something closer to drop-in replacements.
>
> A true drop-in replacement for one of the NSA curves would be a
> small-parameter Edwards curve over the same field, satisfying the
> ‘SafeCurves’ criteria, with a=1 and non-square d, such that:
>
> * the isogenous Montgomery form (*not* the isomorphic one; see the
> ‘Isogeny patterns among Edwards curves’ thread) is isomorphic to a
> short-Weierstrass curve with a=-3; and
>
> * if the field order is congruent to 3 mod 4, the isogenous Edwards
> curve a'=-1, d'=d-1 has non-square d/a (so square d).
>
> Existing implementations of the NSA curves could simply swap out their
> short-Weierstrass b to start using the new curves; over time,
> implementations could be updated to use faster, safer formulas on the
> new curves.

We could start with x^2 + y^2 = 1 - 14666 x^2 y^2 mod 2^192-2^64-1.  The isogenous curve — y^2 = x^3 + 58666*x^2 + x — is isomorphic to y^2 = x^3 - 3*x + 6047900113480193987160910265022055632294672911518856488260.

>> only vector or hardware implementations, but also the fast integer
>> multipliers already available to software implementations in many
>> platforms. Of course, these could require specialized assembly-language
>> multipliers for optimal performance. You can find some brief notes below.
>
> You seem to be assuming that (a) the implementor can use (an
> equivalent of) the Intel ADC instruction, and (b) ADC is fast.  (a)
> does not hold for C implementations; according to the Ed25519 paper,
> (b) does not hold for whichever of Intel's AMD64 chips was newest at
> that time.

You can access ADC by using __uint128_t on clang and gcc (and possibly on other platforms).  It's still faster in assembly or intrinsics, mostly due to the register allocator barfing on EC numerics code, but it still pretty much works in C.

ADC is also passably fast on most processors, though it works better on AMD than Intel I've heard.  At 256 bits, it's not necessarily worth having extra limbs to reduce the number of ADC instructions.  At 384 bits, it may be worth going to extra limbs, and also Karatsuba may be profitable.  By 448 bits, you almost definitely want both reduced-radix and Karatsuba.  I think 2^521-1 probably wants 9x58-bit limbs and 3-way Karatsuba, but I haven't tuned my implementation yet.

Diego, have you implemented arithmetic mod the primes in your paper?  Do you know whether they're fast or not, and with what implementations, and maybe even on what platforms, or are you speculating?

Cheers,
-- Mike

```