[curves] Implementation
Michael Hamburg
mike at shiftleft.org
Fri Feb 21 14:58:31 PST 2014
Hi all,
Rambus has agreed to release the code! It’s on Sourceforge (for export control reasons) at
https://sourceforge.net/p/ed448goldilocks/code/ci/master/tree/
The code is still experimental, and I haven’t picked a hash yet (suggestions?). So you can do keygen and ECDH, except that the output of ECDH isn’t hashed. The base operations for sign and verify are there too, but I haven’t implemented those operations yet.
I should be able to update the repository without future export control issues.
Cheers,
— Mike
On Feb 12, 2014, at 1:50 PM, Michael Hamburg <mike at shiftleft.org> wrote:
> To be clear, the requirements from our legal department are:
> * I need to ask them before releasing code, and
> * I need to comply with US export control law, which (as Legal interprets it) requires preventing IPs associated with a few specific countries from downloading the software.
>
> I don’t believe that GitHub can filter by country (please tell me if that’s changed), but SourceForge or some other service might be able to.
>
> Anyway, I haven’t gotten the ball rolling yet, so it might take a while to get permission, especially since the software isn’t done yet. But I’ll see what I can do.
>
> Cheers,
> — Mike
>
> On Feb 12, 2014, at 12:26 PM, Michael Hamburg <mike at shiftleft.org> wrote:
>
>> Hi Trevor,
>>
>> Not a github initially, because of Rambus legal and export control and all that. I’ll see if I can set up something more private and get back to you.
>>
>> Cheers,
>> — Mike
>>
>> On Feb 12, 2014, at 11:22 AM, Trevor Perrin <trevp at trevp.net> wrote:
>>
>>> Could we expect a github? I'd love to see this!
>>>
>>> Trevor
>>>
>>>
>>> On Tue, Feb 11, 2014 at 12:31 AM, Mike Hamburg <mike at shiftleft.org> wrote:
>>>> Hello curves,
>>>>
>>>> I've been working on implementation for the new curves, and I'd like to
>>>> report status and some formulas and issues I found.
>>>>
>>>> I'm aiming for a fairly generic C/intrinsics implementation which should
>>>> support any curves with minimal extra effort, but I'm starting with
>>>> Ed448-Goldilocks because it's mine. I have Haswell and Sandy Bridge test
>>>> machines. I also have a vectorless Cortex A9, but it doesn't work yet
>>>> because I'm using 64x64->128-bit multiply intrinsics. Here's what I've
>>>> found so far.
>>>>
>>>> If you have any suggestions on the formulas or algorithms, I'd definitely
>>>> appreciate it.
>>>>
>>>> Field arithmetic:
>>>> * Karatsuba is beneficial for Ed448.
>>>> * Radix 2^56 in a 64-bit limb, 8 limbs.
>>>> * M ~ 153cy on Sandy Bridge, 125cy on Haswell
>>>> * square ~ 0.75M
>>>> * small fixed mul ~ 0.25M
>>>> * add/sub (unreduced) ~ 0.04M, a little cheaper on Haswell because of AVX
>>>>
>>>> I'm using the 1/sqrt(x) point encoding for now, basically because I already
>>>> have code for that from an earlier project. I'm not yet counting the time
>>>> to serialize and deserialize field elements, which is maybe 100 cyles at
>>>> most (counting the full reduce / checking that input is fully reduced). I'm
>>>> not yet counting hashing or RNG times.
>>>>
>>>> My earlier email about 1/sqrt(x) was slightly off: it encodes even points on
>>>> the curve, but odd points on the twist.
>>>>
>>>> I haven't tried blind+EGCD for inverses or Legendre symbol checks. It might
>>>> well be a win. One inverse square root is 56k Sandy cycles (I don't
>>>> remember the Haswell number).
>>>>
>>>> Full Montgomery ladder:
>>>> * Decompress.
>>>> * Constant-time ladder by 448-bit scalar. The scalar should be even for
>>>> security. It actually could be 447 bits.
>>>> * Recompress. Reject points on the twist. This is basically free, but
>>>> important because they can't be encoded with the 1/sqrt(x) encoding.
>>>>
>>>> This takes about 571kcy on Haswell, and 688kcy on Sandy, corrected for
>>>> TurboBoost.
>>>>
>>>> I'm using the formula from the thread on efficient laddering with the
>>>> isomorphic curve, but twisted. Let (xd,zd) be the point to de doubled, and
>>>> (xa,za) be the point to be added.
>>>> A = (xd+zd)
>>>> B = (xd-zd)
>>>> DA = (xa-za)*A
>>>> BC = (xa+za)*B
>>>>
>>>> oxa = (DA+BC)^2
>>>> oza = (DA-BC)^2 * xbase
>>>>
>>>> AA = A^2
>>>> BB = B^2
>>>> AAod = AA*(1-d)
>>>> E = AA-BB
>>>>
>>>> oxd = AAod*BB
>>>> ozd = E*(AAod-E)
>>>>
>>>> return (oxd,ozd,oxa,oza)
>>>>
>>>> Except I'm actually using zbase instead of xbase, because of the 1/sqrt(x)
>>>> format.
>>>>
>>>> Twisted Edwards (a=-1) windowed algorithm:
>>>> * Assumes that cofactor is canceled somehow.
>>>> * Recode scalar in signed form, because it's easy and I'm lazy.
>>>> * Compute 8 odd multiples of P.
>>>> * Constant-time add/sub chain with a 4-bit window, 448 bits. Could be 446
>>>> bits, except that 446 isn't divisible by 4.
>>>> * No compress or decompress.
>>>>
>>>> This takes slightly less time than the Montgomery ladder, some 530kcy on
>>>> Haswell and 636 kcy on Sandy. A 5-bit window makes things maybe 1-2%
>>>> faster, but uses extra complexity and memory so I didn't think it was
>>>> worthwhile.
>>>>
>>>> I'm using readdition coordinates:
>>>> "Projective half-niels" for the tables, ((y-x)/2 : (y+x)/2 : dxy : 1) * z.
>>>> "Lazy extended coordinates" for the accumulator, (x : y : z : t : u) where
>>>> xy = tuz.
>>>>
>>>> I might replace the lazy extended coordinates with Hisil et al's lookahead
>>>> extended-or-not coordinates, which use less memory but require more care.
>>>>
>>>> Full constant-time scalarmul using twisted Edwards:
>>>> * Decompress points, rejecting those on the twist.
>>>> * Isogenize to the twisted curve, canceling the cofactor.
>>>> * Above windowed algorithm.
>>>> * Isogenize back to the main curve, effectively multiplying by 4.
>>>> * Recompress.
>>>>
>>>> This takes slightly longer than the Montgomery ladder: something like 633kcy
>>>> on Haswell and 750kcy on Sandy. So Edwards or twisted Edwards is best for
>>>> points you've already got in projective form, and Montgomery is best for
>>>> ECDH. Unsurprising.
>>>>
>>>> The total executable code size to test and bench the arithmetic and curve
>>>> routines is currently around 41k under clang -O4 -fPIC. That'll get bigger
>>>> once there are precomputed tables.
>>>>
>>>> Formulas:
>>>> I'm making use of the "inverse square root trick":
>>>>
>>>> def trick(a,b,i):
>>>> # assumes p==3 mod 4; similar trick exists for 1 mod 4
>>>> # returns sqrt(+-a/b), 1/i, is_square(a/b)
>>>> # assumes a,b,i are nonzero
>>>> ai = a*i
>>>> abi = b*ai
>>>> s = 1/sqrt(+- abi*i) # using a powering ladder
>>>> output sqrt(+-a/b) = s*ai
>>>> s2abi = s^2*abi
>>>> issquare = s2abi * i # = Legendre symbol
>>>> if you care about the result of 1/i when a/b is nonsquare:
>>>> output 1/i = s2abi*issquare
>>>> else:
>>>> output 1/i = s2abi
>>>>
>>>> You can tweak the trick to change the Legendre symbol of the output
>>>> according to some other variable as well; this depends on the residue of p
>>>> mod 8.
>>>>
>>>> The formula I'm using for point compression with Montgomery form is:
>>>>
>>>> Let P1 + P2 = P3 and (u1,v1) = P1 etc. Then
>>>> 4*v1*v2*u3 = (u1*u2-1)^2 - u3^2*(u1-u2)^2
>>>>
>>>> To compute the numerator of the RHS, do:
>>>> sa = (z2*z1 - x2*x1) * z3
>>>> sb = (x2*z1 - z2*x1) * x3
>>>> numerator = (sa + sb) * (sa - sb)
>>>> This is good enough to get the Legendre symbol. It shouldn't be too hard to
>>>> convert this into a formula with some other sign bit using the inverse
>>>> square root trick.
>>>>
>>>> This is on an untwisted (B=1) curve, but the same "ought" to be true of
>>>> 4*B*v1*v2*u3 on a twisted one.
>>>>
>>>> To serialize an Edwards point, we have to deal with the fact that the
>>>> isomorphic curve you'd get from Wikipedia is twisted, because it sets B =
>>>> 4/(1-d) which isn't square, at least when p==3 mod 4. So I'm negating x to
>>>> get to the curve:
>>>> 4y^2/(d-1) = x^3 + 2(d+1)/(d-1) * x^2 + x
>>>> where you can then scale y by sqrt(4/(d-1)) to get the standard curve.
>>>>
>>>> To deserialize an Edwards point, compute
>>>> denominator = (u+1)^2 * (d-1) + 4u
>>>> x = 2 sqrt(u/denominator)
>>>> y = (1+u)/(1-u)
>>>> using the inverse square root trick. This lands you on E_(1,d), because it
>>>> scales the x-coordinate to get rid of the twisting that the obvious
>>>> decompression would give you.
>>>>
>>>> You have to check if u=0 or u=1. The latter isn't on the curve, but you
>>>> have to make sure it doesn't slip past the check due to the zero divide.
>>>> The former works in the 1/sqrt(x) encoding without any checks.
>>>>
>>>> To do:
>>>> I'm planning to use WNAF for variable time scalar mul, WNAF for signature
>>>> verification, and a precomputed signed comb for key generation and Schnorr
>>>> signing.
>>>>
>>>> I'm experimenting with the best way to implement Elligator. I currently
>>>> only have the map to the curve done, and I might change the signs. My
>>>> implementation maps directly to affine using the inverse square root trick.
>>>> I'll report the formula once I'm done messing around with it.
>>>>
>>>> And of course there's API packaging, testing on ARM, etc.
>>>>
>>>> Cheers,
>>>> -- Mike
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Curves mailing list
>>>> Curves at moderncrypto.org
>>>> https://moderncrypto.org/mailman/listinfo/curves
>>>>
>>
>> _______________________________________________
>> Curves mailing list
>> Curves at moderncrypto.org
>> https://moderncrypto.org/mailman/listinfo/curves
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://moderncrypto.org/mail-archive/curves/attachments/20140221/01fd4516/attachment.html>
More information about the Curves
mailing list