[curves] Granger-Moss primes for 32 bit implementations

Mon Sep 11 10:37:21 PDT 2017

On Fri, Sep 8, 2017 at 9:55 PM, Mike Hamburg <mike at shiftleft.org> wrote:

> I might possibly have written an ECC implementation that documents: “The
> following operations are not implemented for NIST P224 because I didn’t
> want to compute square roots: …”
>
> At least with only 2^24+1, you don’t have to implement Sutherland’s
> algorithm.  But the cost of Tonelli-Shanks is nontrivial, since it can
> still take hundreds of multiplications.
>

I implemented it and added counting. For the first 10,000 residues the
average # of multiplies for the iterative portion is 183. 22 of those would
have to be done anyway if p were 3 mod 4.
So 161 extra field multiplications for square roots. Taking the square root
of u/v requires an additional 24 squares and 9 multiplies.
I at least figured out how to do equality checks in band without having to
reduce the value completely. My insight was to set the first component to t
+ t/2, and then do a parallel carry. If the value is 0, all the carries
will be done, and the you check that all components are the same.

I'll try to get some speed #'s but it's at least nice to know that
Tonelli-Shanks isn't too difficult to implement (at least in Haskell), and
that the # of multiplies isn't insurmountable.

For example, my phi 17 prime has p-1 = 2^22*q, 431 bits, and 136 mulitplies
per field multiply or square. Comparing 24 vs 22, you should expect 133
extra field multiplies per square root.
The Ed41417 prime paper lists 144 multiplies per square. Saving 8
mulitplies per field multiplication, with ~8 field multiplies per point
addition is 64 multiplies per point, saving a field multiply just over
every 2 point additions.
It's easy to see that over a scalar multiply, you could theoretically make
it up.

I've never done CUDA/OpenCL, but these should all vectorize nicely. It will
be a worthwhile reason to learn vector programming.

>
> Anyway, I’ll be interested to know what performance you get.
>
> > On Sep 8, 2017, at 9:21 PM, Kyle Butt <kyle at iteratee.net> wrote:
> >
> > TLDR: How much do square roots matter for ECC primes?
> >
> > I've been working on finding Granger Moss primes where t fits in a 32
> bit integer.
> > Along the way, I worked out tighter bounds for the min and max values
> after reduction.
> >
> > In the paper, they aren't explicit about how much extra room is needed
> to handle additions for curve operations. For edwards curves, you need to
> account for a factor of 6. This changes their formula
> > ceil(log_2(m / 2)) + 2k + 5 <= 2w
> > to:
> > ceil(log_2(3 * m) + 2k + 5 <= 2w
> >
> > for m + 1 = 17 and 19, this requires k < 26.5
> >
> > Similarly for their reduction algorithm, they work in bits, but I did
> the math on the actual bounds.
> >
> > Assuming z_i = 2^63 - 1, after 1 reduction the max value is 2^(63 - l) -
> 1 + (t - c)
> > t - c = (b - 1) * c, the max value carried from z_(i+1), assuming c < b,
> after 2 reductions the max value is:
> > 2^(63 - 2*l) - 1 + c + (t - c).  = 2^(63 - 2*l) + t - 1. The first c
> occurs because t/b = c.
> > The minimum value is easier to calculate. In this case, the worst case
> carry is 0.
> > so the min value is 2^(63 - 2 * l).
> >
> > Upon multiplying these values are subtracted. If their difference is
> less than 2^27.5, then we can adjust the formula above to:
> >
> > ceil(log_2(3 * m) + 2k + 4 <= 2w
> >
> > This does place a greater constraint on l, but it yields larger primes.
> >
> > because we know that the result of the product takes 1 less bit.
> > This allows us to construct larger primes for m + 1 = 17, 19.
> > For m+1 = 19 we can construct a prime with 483 bits, specifically:
> > p = phi 18 t, where t = (2^3 - 1) * (2^24); b = 2^24; l = 24
> >
> > Unfortunately, these primes are not fast square root primes, by
> construction. It should be clear that p % (2^24) = 1. How much does this
> matter? S is 24 in this case, so Tonelli Shanks should be reasonably fast.
> Equality checking is also a little tricky due to the redundancy, but a
> subtraction single round of modular carries can be done in parallel. Then
> you can verify that all the limbs are the same value.
> >
> > I found an edwards curve that satisfies the safecurves criteria, the
> constant is unusual but rigid:
> > It is the least d in absolute value such that x^2 + y^2 = 1 +
> (d/b)x^2*y^2 has cofactor {4, 8} and so does its twist. The reason for
> choosing such a d is that coefficient multiplication requires reduction,
> which means that even for small constants, they must be in the montgomery
> domain, taking up at least 2 limbs. But for a constant d/b it can be a
> single limb.
> >
> > the least such d is: -56904
> > for that curve the trace is: 180587655261632913936542935860
> 4192390835234313183868353172046570215054410
> >
> > I prototyped the code in Haskell, for quick turnaround and so that I
> have something where I can print the intermediate values as test vectors
> for other implementations. I can tidy it up and share it if there's
> interest. It's not written for speed but for verification.
> >
> > I plan to try and implement a cuda or opencl version, to try and take
> advantage of the parallel nature of the primes.
> >
> > Thoughts or suggestions? Would it be worthwhile to write any of this up
> in a more formal setting?
> >
> > I think at a minimum borrowing the half bit by computing tighter bounds
> is pretty cool.
> >
> > Kyle
> > _______________________________________________
> > Curves mailing list
> > Curves at moderncrypto.org
> > https://moderncrypto.org/mailman/listinfo/curves
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://moderncrypto.org/mail-archive/curves/attachments/20170911/9ea34121/attachment.html>