[curves] Curve448 implementation questions

Fri Nov 13 11:49:40 PST 2015

> On Nov 13, 2015, at 6:09 AM, Watson Ladd <watsonbladd at gmail.com> wrote:
> 
> 
> On Nov 13, 2015 8:44 AM, "Nicholas Wilson" <nicholas at nicholaswilson.me.uk <mailto:nicholas at nicholaswilson.me.uk>> wrote:
> >
> > Hi,
> >
> > I have some questions about implementing Curve448, having implemented
> > it for mbedTLS in a pull request here:
> > https://github.com/ARMmbed/mbedtls/pull/348 <https://github.com/ARMmbed/mbedtls/pull/348>
> >
> > I've been following the discussion around the IRTF standard, but I'm
> > still not quite sure what the recommended behaviour is for validating
> > public points.
> >
> > In the latest standard, regarding public u-values:
> >
> > "When receiving such an array, implementations of X25519 (but not
> > X448) MUST mask the most-significant bit in the final byte."
> > (https://tools.ietf.org/html/draft-irtf-cfrg-curves-11#section-4.2 <https://tools.ietf.org/html/draft-irtf-cfrg-curves-11#section-4.2>
> >
> > This suggests, but doesn't state, that implementations shouldn't do
> > any masking for Curve448, but should instead just reduce the public
> > value mod P448 (or issue an error if it's not in canonical form,
> > probably my preferred implementation choice). Is that correct?
> 
> Yes. But note that Curve448 is designed to work correctly with all inputs. There are issues with TLS because of requirements on contributory behavior.
> 
> 

Right.  The X448 spec says not to error out except on small-order points (i.e. ones that give you 0 as output).  Whether the small-order points are a problem or not depends on your protocol; for some protocols it may be safe to ignore the error return and use the 0 output.
> > Secondly, I have a question about the implementation of the arithmetic
> > itself. I've had a hunt for Mike's various papers and presentations on
> > Ed448-Goldilocks, and I think I understand the rationale for the
> > choice of prime. What I can't find though is a simple do-this-do-that
> > guide for implementers, like NIST publishes for their primes.
> 
> NIST guide is hard to make constant time. I'm not Mike, but as I recall the division is into 56 bit limbs to avoid carries and permit vectorised additions. The formula isn't that hard to work out from there: it's a matter of adding the top pieces to the bottom in the right places. Safety analysis is a bit trickier.
> 
> The big benefit is you avoid carries and dependent instruction chains.
> 
Right.  If you already have an engine that does all the NIST primes in a modular way (eg in hardware), then you can use the NIST approach, but otherwise reduced radix is the way to go.
> > In the end, I've come up with my own modular reduction formula after
> > playing around with the equations, and it seems to be reasonable
> > (three 448-sized additions), but I wondered if I'm missing any tricks
> > that should make it even quicker.
> >
> > I'm aware that for 32-bit machines, you can split the numbers up into
> > 32-bit limbs and write q=2^32, p = q^14 - q^7 - 1 and come up with a
> > big complex formula to explictly do your reduction in terms of the
> > limb values.
> >
> > I'm more interested in optimising for 64-bit though, so I've chosen to
> > split the input into four "limbs" of 2^224 and do a few bignum
> > additions on those limbs, which will use the bignum library's 64-bit
> > CPU operations, and may come out quicker. The code is certainly
> > simpler, and I don't think anyone deploying Curve448 really cares
> > about performance as long as it's "good enough".
> >
> > Let N = A0 + 2^448 A1, let A1 = B0 + 2^224 B1. Then N (or N+P) mod P
> > is A0+A1+B1+(B0+B1)*2^224. This works on paper, and produces the
> > expected output on the test vectors from the IRTF spec.
> >
> > So my question is whether this is the expected reduction formula, or
> > whether there's some method which is simpler still.
> 
This is the expected reduction formula.  However, I believe it’s easier to implement Ed448 using reduced radix, where you store 8, 56-bit limbs as 8, 64-bit words with some headroom, and only do the exact reduction at the very end.  Maybe this isn’t true in your case because you already have a bignum engine, but it might be worth taking a look.

The simplest multiplication formula I know is the one found in  http://sourceforge.net/p/ed448goldilocks/code/ci/x448/tree/x448.c <http://sourceforge.net/p/ed448goldilocks/code/ci/x448/tree/x448.c> though it’s macro’d up so that it works for both 32- and 64-bit limbs.  You use an array of 8, 128-bit accumulators. Instead of multiplying AxB and then reducing, at each step you multiply A by the i’th digit of B and add to the accumulators, and then you multiply A by 2^56 and reduce in place.

The in-place multiplication is done up to logical shifting.  That is, if you have A = A0, A1, A2, A3, A4, A5, A6, A7, then A*2^56 = (A7, A0, A1, A2, A3+A7, A4, A5, A6): it’s a rotation of A with one addition.  But there’s no need to actually rotate A by one position: it’s better to just adjust your indexing.

At the end you propagate the carries in the accumulators.

The same algorithm can be used with product-scanning, where you first compute accum[0], then accum[1], and so on.  This might work better on x86 since it doesn’t actually have eight 128-bit registers.  But I think the compiler will treat the two as almost the same anyway, at least if the loops are unrolled and optimizations are on.  The usual difference between product-scanning and operand-scanning is the carry propagation, but here they’re only propagated at the end.  The other difference is just the order you do the multiplications in, but the compiler will completely reorder them anyway.

P448 is also designed to support a relatively simple Karatsuba multiplication algorithm, which should take about 20% less time.  The version at http://sourceforge.net/p/ed448goldilocks/code/ci/decaf/tree/src/p448/arch_ref64/p448.c <http://sourceforge.net/p/ed448goldilocks/code/ci/decaf/tree/src/p448/arch_ref64/p448.c>, p448_mul, is the closest thing I have to a plan of record for that one.  But of course, the “schoolbook” method is simpler.  Note also that in this version, c is __restrict__ because we start writing results to it before we’re done with a and b.

The simpler multiplication algorithm in x448.c is not especially easy to turn into a dedicated squaring algorithm.  But if you want the performance increase of a dedicated squarer, then you probably want Karatsuba too.
> > Thanks for your help,
> > Nick
> 

Cheers,
— Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://moderncrypto.org/mail-archive/curves/attachments/20151113/d535e66c/attachment.html>