[curves] Microsoft ECCLib for "NUMS" curves

Thu Jul 3 16:00:08 PDT 2014

Trevor Perrin writes:
> Microsoft's field prime is 2^256 - 189 instead of 2^255 - 19.
> Seems pretty similar.  Is there any inherent efficiency difference for
> scalar multiplication?

Yes, although the details depend on the CPU.

As a review, the old-fashioned approach to multiprecision software is to
use the largest possible radix 2^b compatible with the CPU's multiplier;
i.e., b = r/2 if the multiplier produces r-bit outputs. The inner loop
is then multiply followed by an r-bit add _with carry_. This is okay on
CPUs with particularly fast carries but it's a big performance problem
on most CPUs.

The modern approach is more flexible, allowing b smaller than r/2. This
has the huge advantage that you can skip many carries. Specifically, you
can add as many as 2^(r-2b) products without overflow. The exact impact
depends on k, the number of input limbs; if 2^(r-2b) is large enough
compared to k then the number of carries drops to only about 2k. Of
course, reducing b also means reducing the size of integers that you can
handle for any particular value of k, but on most CPUs this is amply
justified by the reduced number of carries.

What's particularly interesting is the impact of reducing mod 2^(kb)-c.
The old-fashioned approach here is integer multiplication followed by
reduction; the reduction involves about k carries, for a total of about
3k carries. But you can do three times better than this _if_ c is small
enough. The idea is simply to delay the carries until after the
reductions mod 2^(kb)-c. This means accumulating the 2b-bit products
into just k outputs, including the obvious multiplications by c, and
then doing the carries. The size of c has a huge impact on how large b
can be before this overflows.

As a concrete example, here's how large integers can be for various
approaches with k=9 and r=64:

   * 288 bits: The old-fashioned approach takes b=32 and requires a ton
     of carries.

   * 270 bits: The modern approach without merged reduction takes b=30
     and uses about 3k carries.

   * 252 bits: The modern approach with merged reduction for c=19 takes
     b=28 and uses about k carries.

   * 234 bits: The modern approach with merged reduction for c=189 takes
     b=26 and uses about k carries.

The picture is actually smoother than this, for several reasons:

   * With the modern approach it's often good to use signed limbs rather
     than unsigned limbs: i.e., to take limbs between -2^(b-1) and
     2^(b-1) rather than between 0 and 2^b. Products are then between
     -2^(2b-2) and 2^(2b-2) rather than between 0 and 2^2b. The
     disadvantage of this approach is that on most CPUs signed carries
     are more expensive than unsigned carries, but with the modern
     approach there aren't many carries in the first place.

   * With the modern approach it's often good to use a non-integer
     radix. For example, 2^255-19 fits into 9 signed limbs with merged
     reduction with a non-integer radix 2^(85/3). For 2^256-189 the same
     approach overflows most of the outputs and needs many more carries.

   * Higher-level computations often involve ab+cd, (a+b)c, etc. If
     there's a spare bit then the carries can be merged.

By now there have been several Curve25519 implementation papers for
various platforms. Most of them have taken advantage of the smallness of
19 and would lose speed for 2^256-189.

Of course, beyond the performance loss, 2^256-189 comes with annoyances
such as needing _more_ than 32 bytes for the usual compressed encoding
of a curve point (x,y). What's the corresponding advantage supposed to
be? I checked the corresponding paper and found

   * primes 3 mod 4 being praised as allowing square roots to be
     computed "efficiently, and in constant-time", and

   * 256 bits being praised as gaining "half a bit of ECDLP security".

The first argument is not an advantage: it also applies to 2^255-19 and
other primes 5 mod 8, as shown in the Ed25519 paper and software. As for
the second argument, maybe the MSR people believe that Moore's law will
eventually reach Curve25519's quantitative security level (and that we
won't have switched to post-quantum cryptography before then), but are
they seriously arguing that a sane response is to switch to something
that's breakable with 1.4x or 2x the effort?

---Dan