[curves] Implementation

Michael Hamburg mike at shiftleft.org
Wed Feb 12 13:50:28 PST 2014


To be clear, the requirements from our legal department are:
* I need to ask them before releasing code, and
* I need to comply with US export control law, which (as Legal interprets it) requires preventing IPs associated with a few specific countries from downloading the software.

I don’t believe that GitHub can filter by country (please tell me if that’s changed), but SourceForge or some other service might be able to.

Anyway, I haven’t gotten the ball rolling yet, so it might take a while to get permission, especially since the software isn’t done yet.  But I’ll see what I can do.

Cheers,
— Mike

On Feb 12, 2014, at 12:26 PM, Michael Hamburg <mike at shiftleft.org> wrote:

> Hi Trevor,
> 
> Not a github initially, because of Rambus legal and export control and all that.  I’ll see if I can set up something more private and get back to you.
> 
> Cheers,
> — Mike
> 
> On Feb 12, 2014, at 11:22 AM, Trevor Perrin <trevp at trevp.net> wrote:
> 
>> Could we expect a github?  I'd love to see this!
>> 
>> Trevor
>> 
>> 
>> On Tue, Feb 11, 2014 at 12:31 AM, Mike Hamburg <mike at shiftleft.org> wrote:
>>> Hello curves,
>>> 
>>> I've been working on implementation for the new curves, and I'd like to
>>> report status and some formulas and issues I found.
>>> 
>>> I'm aiming for a fairly generic C/intrinsics implementation which should
>>> support any curves with minimal extra effort, but I'm starting with
>>> Ed448-Goldilocks because it's mine.  I have Haswell and Sandy Bridge test
>>> machines.  I also have a vectorless Cortex A9, but it doesn't work yet
>>> because I'm using 64x64->128-bit multiply intrinsics.  Here's what I've
>>> found so far.
>>> 
>>> If you have any suggestions on the formulas or algorithms, I'd definitely
>>> appreciate it.
>>> 
>>> Field arithmetic:
>>> * Karatsuba is beneficial for Ed448.
>>> * Radix 2^56 in a 64-bit limb, 8 limbs.
>>> * M ~ 153cy on Sandy Bridge, 125cy on Haswell
>>> * square ~ 0.75M
>>> * small fixed mul ~ 0.25M
>>> * add/sub (unreduced) ~ 0.04M, a little cheaper on Haswell because of AVX
>>> 
>>> I'm using the 1/sqrt(x) point encoding for now, basically because I already
>>> have code for that from an earlier project.  I'm not yet counting the time
>>> to serialize and deserialize field elements, which is maybe 100 cyles at
>>> most (counting the full reduce / checking that input is fully reduced).  I'm
>>> not yet counting hashing or RNG times.
>>> 
>>> My earlier email about 1/sqrt(x) was slightly off: it encodes even points on
>>> the curve, but odd points on the twist.
>>> 
>>> I haven't tried blind+EGCD for inverses or Legendre symbol checks.  It might
>>> well be a win.  One inverse square root is 56k Sandy cycles (I don't
>>> remember the Haswell number).
>>> 
>>> Full Montgomery ladder:
>>> * Decompress.
>>> * Constant-time ladder by 448-bit scalar.  The scalar should be even for
>>> security.  It actually could be 447 bits.
>>> * Recompress.  Reject points on the twist.  This is basically free, but
>>> important because they can't be encoded with the 1/sqrt(x) encoding.
>>> 
>>> This takes about 571kcy on Haswell, and 688kcy on Sandy, corrected for
>>> TurboBoost.
>>> 
>>> I'm using the formula from the thread on efficient laddering with the
>>> isomorphic curve, but twisted.  Let (xd,zd) be the point to de doubled, and
>>> (xa,za) be the point to be added.
>>>   A = (xd+zd)
>>>   B = (xd-zd)
>>>   DA = (xa-za)*A
>>>   BC = (xa+za)*B
>>> 
>>>   oxa = (DA+BC)^2
>>>   oza = (DA-BC)^2 * xbase
>>> 
>>>   AA = A^2
>>>   BB = B^2
>>>   AAod = AA*(1-d)
>>>   E = AA-BB
>>> 
>>>   oxd = AAod*BB
>>>   ozd = E*(AAod-E)
>>> 
>>>   return (oxd,ozd,oxa,oza)
>>> 
>>> Except I'm actually using zbase instead of xbase, because of the 1/sqrt(x)
>>> format.
>>> 
>>> Twisted Edwards (a=-1) windowed algorithm:
>>> * Assumes that cofactor is canceled somehow.
>>> * Recode scalar in signed form, because it's easy and I'm lazy.
>>> * Compute 8 odd multiples of P.
>>> * Constant-time add/sub chain with a 4-bit window, 448 bits.  Could be 446
>>> bits, except that 446 isn't divisible by 4.
>>> * No compress or decompress.
>>> 
>>> This takes slightly less time than the Montgomery ladder, some 530kcy on
>>> Haswell and 636 kcy on Sandy.  A 5-bit window makes things maybe 1-2%
>>> faster, but uses extra complexity and memory so I didn't think it was
>>> worthwhile.
>>> 
>>> I'm using readdition coordinates:
>>> "Projective half-niels" for the tables, ((y-x)/2 : (y+x)/2 : dxy : 1) * z.
>>> "Lazy extended coordinates" for the accumulator, (x : y : z : t : u) where
>>> xy = tuz.
>>> 
>>> I might replace the lazy extended coordinates with Hisil et al's lookahead
>>> extended-or-not coordinates, which use less memory but require more care.
>>> 
>>> Full constant-time scalarmul using twisted Edwards:
>>> * Decompress points, rejecting those on the twist.
>>> * Isogenize to the twisted curve, canceling the cofactor.
>>> * Above windowed algorithm.
>>> * Isogenize back to the main curve, effectively multiplying by 4.
>>> * Recompress.
>>> 
>>> This takes slightly longer than the Montgomery ladder: something like 633kcy
>>> on Haswell and 750kcy on Sandy.  So Edwards or twisted Edwards is best for
>>> points you've already got in projective form, and Montgomery is best for
>>> ECDH.  Unsurprising.
>>> 
>>> The total executable code size to test and bench the arithmetic and curve
>>> routines is currently around 41k under clang -O4 -fPIC.  That'll get bigger
>>> once there are precomputed tables.
>>> 
>>> Formulas:
>>> I'm making use of the "inverse square root trick":
>>> 
>>> def trick(a,b,i):
>>>   # assumes p==3 mod 4; similar trick exists for 1 mod 4
>>>   # returns sqrt(+-a/b), 1/i, is_square(a/b)
>>>   # assumes a,b,i are nonzero
>>>   ai = a*i
>>>   abi = b*ai
>>>   s = 1/sqrt(+- abi*i) # using a powering ladder
>>>   output sqrt(+-a/b) = s*ai
>>>   s2abi = s^2*abi
>>>   issquare = s2abi * i # = Legendre symbol
>>>   if you care about the result of 1/i when a/b is nonsquare:
>>>       output 1/i = s2abi*issquare
>>>   else:
>>>       output 1/i = s2abi
>>> 
>>> You can tweak the trick to change the Legendre symbol of the output
>>> according to some other variable as well; this depends on the residue of p
>>> mod 8.
>>> 
>>> The formula I'm using for point compression with Montgomery form is:
>>> 
>>> Let P1 + P2 = P3 and (u1,v1) = P1 etc.  Then
>>>   4*v1*v2*u3 = (u1*u2-1)^2 - u3^2*(u1-u2)^2
>>> 
>>> To compute the numerator of the RHS, do:
>>>   sa = (z2*z1 - x2*x1) * z3
>>>   sb = (x2*z1 - z2*x1) * x3
>>>   numerator = (sa + sb) * (sa - sb)
>>> This is good enough to get the Legendre symbol.  It shouldn't be too hard to
>>> convert this into a formula with some other sign bit using the inverse
>>> square root trick.
>>> 
>>> This is on an untwisted (B=1) curve, but the same "ought" to be true of
>>> 4*B*v1*v2*u3 on a twisted one.
>>> 
>>> To serialize an Edwards point, we have to deal with the fact that the
>>> isomorphic curve you'd get from Wikipedia is twisted, because it sets B =
>>> 4/(1-d) which isn't square, at least when p==3 mod 4.  So I'm negating x to
>>> get to the curve:
>>>   4y^2/(d-1) = x^3 + 2(d+1)/(d-1) * x^2 + x
>>> where you can then scale y by sqrt(4/(d-1)) to get the standard curve.
>>> 
>>> To deserialize an Edwards point, compute
>>>   denominator = (u+1)^2 * (d-1) + 4u
>>>   x = 2 sqrt(u/denominator)
>>>   y = (1+u)/(1-u)
>>> using the inverse square root trick.  This lands you on E_(1,d), because it
>>> scales the x-coordinate to get rid of the twisting that the obvious
>>> decompression would give you.
>>> 
>>> You have to check if u=0 or u=1.  The latter isn't on the curve, but you
>>> have to make sure it doesn't slip past the check due to the zero divide.
>>> The former works in the 1/sqrt(x) encoding without any checks.
>>> 
>>> To do:
>>> I'm planning to use WNAF for variable time scalar mul, WNAF for signature
>>> verification, and a precomputed signed comb for key generation and Schnorr
>>> signing.
>>> 
>>> I'm experimenting with the best way to implement Elligator.  I currently
>>> only have the map to the curve done, and I might change the signs.  My
>>> implementation maps directly to affine using the inverse square root trick.
>>> I'll report the formula once I'm done messing around with it.
>>> 
>>> And of course there's API packaging, testing on ARM, etc.
>>> 
>>> Cheers,
>>> -- Mike
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Curves mailing list
>>> Curves at moderncrypto.org
>>> https://moderncrypto.org/mailman/listinfo/curves
>>> 
> 
> _______________________________________________
> Curves mailing list
> Curves at moderncrypto.org
> https://moderncrypto.org/mailman/listinfo/curves



More information about the Curves mailing list