On Nov 13, 2015 8:44 AM, "Nicholas Wilson" <<a href="mailto:nicholas@nicholaswilson.me.uk">nicholas@nicholaswilson.me.uk</a>> wrote: > > Hi, > > I have some questions about implementing Curve448, having implemented > it for mbedTLS in a pull request here: > <a href="https://github.com/ARMmbed/mbedtls/pull/348">https://github.com/ARMmbed/mbedtls/pull/348</a> > > I've been following the discussion around the IRTF standard, but I'm > still not quite sure what the recommended behaviour is for validating > public points. > > In the latest standard, regarding public u-values: > > "When receiving such an array, implementations of X25519 (but not > X448) MUST mask the most-significant bit in the final byte." > (<a href="https://tools.ietf.org/html/draft-irtf-cfrg-curves-11#section-4.2">https://tools.ietf.org/html/draft-irtf-cfrg-curves-11#section-4.2</a> > > This suggests, but doesn't state, that implementations shouldn't do > any masking for Curve448, but should instead just reduce the public > value mod P448 (or issue an error if it's not in canonical form, > probably my preferred implementation choice). Is that correct? Yes. But note that Curve448 is designed to work correctly with all inputs. There are issues with TLS because of requirements on contributory behavior. > > Secondly, I have a question about the implementation of the arithmetic > itself. I've had a hunt for Mike's various papers and presentations on > Ed448-Goldilocks, and I think I understand the rationale for the > choice of prime. What I can't find though is a simple do-this-do-that > guide for implementers, like NIST publishes for their primes. NIST guide is hard to make constant time. I'm not Mike, but as I recall the division is into 56 bit limbs to avoid carries and permit vectorised additions. The formula isn't that hard to work out from there: it's a matter of adding the top pieces to the bottom in the right places. Safety analysis is a bit trickier. The big benefit is you avoid carries and dependent instruction chains. > > In the end, I've come up with my own modular reduction formula after > playing around with the equations, and it seems to be reasonable > (three 448-sized additions), but I wondered if I'm missing any tricks > that should make it even quicker. > > I'm aware that for 32-bit machines, you can split the numbers up into > 32-bit limbs and write q=2^32, p = q^14 - q^7 - 1 and come up with a > big complex formula to explictly do your reduction in terms of the > limb values. > > I'm more interested in optimising for 64-bit though, so I've chosen to > split the input into four "limbs" of 2^224 and do a few bignum > additions on those limbs, which will use the bignum library's 64-bit > CPU operations, and may come out quicker. The code is certainly > simpler, and I don't think anyone deploying Curve448 really cares > about performance as long as it's "good enough". > > Let N = A0 + 2^448 A1, let A1 = B0 + 2^224 B1. Then N (or N+P) mod P > is A0+A1+B1+(B0+B1)*2^224. This works on paper, and produces the > expected output on the test vectors from the IRTF spec. > > So my question is whether this is the expected reduction formula, or > whether there's some method which is simpler still. > > Thanks for your help, > Nick > > ---- > Nicholas Wilson: <a href="mailto:nicholas@nicholaswilson.me.uk">nicholas@nicholaswilson.me.uk</a> > _______________________________________________ > Curves mailing list > <a href="mailto:Curves@moderncrypto.org">Curves@moderncrypto.org</a> > <a href="https://moderncrypto.org/mailman/listinfo/curves">https://moderncrypto.org/mailman/listinfo/curves</a>