[curves] Granger-Moss primes for 32 bit implementations

Kyle Butt kyle at iteratee.net
Wed Sep 20 18:19:35 PDT 2017

On Mon, Sep 11, 2017 at 10:37 AM, Kyle Butt <kyle at iteratee.net> wrote:

>
>
> On Fri, Sep 8, 2017 at 9:55 PM, Mike Hamburg <mike at shiftleft.org> wrote:
>
>> I might possibly have written an ECC implementation that documents: “The
>> following operations are not implemented for NIST P224 because I didn’t
>> want to compute square roots: …”
>>
>> At least with only 2^24+1, you don’t have to implement Sutherland’s
>> algorithm.  But the cost of Tonelli-Shanks is nontrivial, since it can
>> still take hundreds of multiplications.
>>
>
> I implemented it and added counting. For the first 10,000 residues the
> average # of multiplies for the iterative portion is 183. 22 of those would
> have to be done anyway if p were 3 mod 4.
> So 161 extra field multiplications for square roots. Taking the square
> root of u/v requires an additional 24 squares and 9 multiplies.
> I at least figured out how to do equality checks in band without having to
> reduce the value completely. My insight was to set the first component to t
> + t/2, and then do a parallel carry. If the value is 0, all the carries
> will be done, and the you check that all components are the same.
>
> I'll try to get some speed #'s but it's at least nice to know that
> Tonelli-Shanks isn't too difficult to implement (at least in Haskell), and
> that the # of multiplies isn't insurmountable.
>
> For example, my phi 17 prime has p-1 = 2^22*q, 431 bits, and 136
> mulitplies per field multiply or square. Comparing 24 vs 22, you should
> expect 133 extra field multiplies per square root.
> The Ed41417 prime paper lists 144 multiplies per square. Saving 8
> mulitplies per field multiplication, with ~8 field multiplies per point
> addition is 64 multiplies per point, saving a field multiply just over
> It's easy to see that over a scalar multiply, you could theoretically make
> it up.
>
> I've never done CUDA/OpenCL, but these should all vectorize nicely. It
> will be a worthwhile reason to learn vector programming.
>

So the half-cyclic convolution is a pain to vectorize, and the compiler
certainly won't do it for you.

I implemented multiply and square for avx2. On my workstation at work I get
163 cycles/multiply and 121/square. Which isn't competitive with the state
of the art, but it's close.

If you wanted differential power resistance, they may be competitive,
because they get it for (almost) free. (Instead of setting the accumulator
to 0, you broadcast a random value 0 t-1)

Someone with more vectorizing experience may be able to push them further,
but that's a lot of work.

If anyone is interested in the code, I can clean it up and post it.

>
>> Anyway, I’ll be interested to know what performance you get.
>>
>> > On Sep 8, 2017, at 9:21 PM, Kyle Butt <kyle at iteratee.net> wrote:
>> >
>> > TLDR: How much do square roots matter for ECC primes?
>> >
>> > I've been working on finding Granger Moss primes where t fits in a 32
>> bit integer.
>> > Along the way, I worked out tighter bounds for the min and max values
>> after reduction.
>> >
>> > In the paper, they aren't explicit about how much extra room is needed
>> to handle additions for curve operations. For edwards curves, you need to
>> account for a factor of 6. This changes their formula
>> > ceil(log_2(m / 2)) + 2k + 5 <= 2w
>> > to:
>> > ceil(log_2(3 * m) + 2k + 5 <= 2w
>> >
>> > for m + 1 = 17 and 19, this requires k < 26.5
>> >
>> > Similarly for their reduction algorithm, they work in bits, but I did
>> the math on the actual bounds.
>> >
>> > Assuming z_i = 2^63 - 1, after 1 reduction the max value is 2^(63 - l)
>> - 1 + (t - c)
>> > t - c = (b - 1) * c, the max value carried from z_(i+1), assuming c <
>> b, after 2 reductions the max value is:
>> > 2^(63 - 2*l) - 1 + c + (t - c).  = 2^(63 - 2*l) + t - 1. The first c
>> occurs because t/b = c.
>> > The minimum value is easier to calculate. In this case, the worst case
>> carry is 0.
>> > so the min value is 2^(63 - 2 * l).
>> >
>> > Upon multiplying these values are subtracted. If their difference is
>> less than 2^27.5, then we can adjust the formula above to:
>> >
>> > ceil(log_2(3 * m) + 2k + 4 <= 2w
>> >
>> > This does place a greater constraint on l, but it yields larger primes.
>> >
>> > because we know that the result of the product takes 1 less bit.
>> > This allows us to construct larger primes for m + 1 = 17, 19.
>> > For m+1 = 19 we can construct a prime with 483 bits, specifically:
>> > p = phi 18 t, where t = (2^3 - 1) * (2^24); b = 2^24; l = 24
>> >
>> > Unfortunately, these primes are not fast square root primes, by
>> construction. It should be clear that p % (2^24) = 1. How much does this
>> matter? S is 24 in this case, so Tonelli Shanks should be reasonably fast.
>> Equality checking is also a little tricky due to the redundancy, but a
>> subtraction single round of modular carries can be done in parallel. Then
>> you can verify that all the limbs are the same value.
>> >
>> > I found an edwards curve that satisfies the safecurves criteria, the
>> constant is unusual but rigid:
>> > It is the least d in absolute value such that x^2 + y^2 = 1 +
>> (d/b)x^2*y^2 has cofactor {4, 8} and so does its twist. The reason for
>> choosing such a d is that coefficient multiplication requires reduction,
>> which means that even for small constants, they must be in the montgomery
>> domain, taking up at least 2 limbs. But for a constant d/b it can be a
>> single limb.
>> >
>> > the least such d is: -56904
>> > for that curve the trace is: 180587655261632913936542935860
>> 4192390835234313183868353172046570215054410
>> >
>> > I prototyped the code in Haskell, for quick turnaround and so that I
>> have something where I can print the intermediate values as test vectors
>> for other implementations. I can tidy it up and share it if there's
>> interest. It's not written for speed but for verification.
>> >
>> > I plan to try and implement a cuda or opencl version, to try and take
>> advantage of the parallel nature of the primes.
>> >
>> > Thoughts or suggestions? Would it be worthwhile to write any of this up
>> in a more formal setting?
>> >
>> > I think at a minimum borrowing the half bit by computing tighter bounds
>> is pretty cool.
>> >
>> > Kyle
>> > _______________________________________________
>> > Curves mailing list
>> > Curves at moderncrypto.org
>> > https://moderncrypto.org/mailman/listinfo/curves
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://moderncrypto.org/mail-archive/curves/attachments/20170920/6db1caa8/attachment.html>