[curves] Granger-Moss primes for 32 bit implementations

Wed Sep 27 00:36:48 PDT 2017

On Wed, Sep 20, 2017 at 6:19 PM, Kyle Butt <kyle at iteratee.net> wrote:

>
>
> On Mon, Sep 11, 2017 at 10:37 AM, Kyle Butt <kyle at iteratee.net> wrote:
>
>>
>>
>> On Fri, Sep 8, 2017 at 9:55 PM, Mike Hamburg <mike at shiftleft.org> wrote:
>>
>>> I might possibly have written an ECC implementation that documents: “The
>>> following operations are not implemented for NIST P224 because I didn’t
>>> want to compute square roots: …”
>>>
>>> At least with only 2^24+1, you don’t have to implement Sutherland’s
>>> algorithm.  But the cost of Tonelli-Shanks is nontrivial, since it can
>>> still take hundreds of multiplications.
>>>
>>
>> I implemented it and added counting. For the first 10,000 residues the
>> average # of multiplies for the iterative portion is 183. 22 of those would
>> have to be done anyway if p were 3 mod 4.
>> So 161 extra field multiplications for square roots. Taking the square
>> root of u/v requires an additional 24 squares and 9 multiplies.
>> I at least figured out how to do equality checks in band without having
>> to reduce the value completely. My insight was to set the first component
>> to t + t/2, and then do a parallel carry. If the value is 0, all the
>> carries will be done, and the you check that all components are the same.
>>
>> I'll try to get some speed #'s but it's at least nice to know that
>> Tonelli-Shanks isn't too difficult to implement (at least in Haskell), and
>> that the # of multiplies isn't insurmountable.
>>
>> For example, my phi 17 prime has p-1 = 2^22*q, 431 bits, and 136
>> mulitplies per field multiply or square. Comparing 24 vs 22, you should
>> expect 133 extra field multiplies per square root.
>> The Ed41417 prime paper lists 144 multiplies per square. Saving 8
>> mulitplies per field multiplication, with ~8 field multiplies per point
>> addition is 64 multiplies per point, saving a field multiply just over
>> every 2 point additions.
>> It's easy to see that over a scalar multiply, you could theoretically
>> make it up.
>>
>> I've never done CUDA/OpenCL, but these should all vectorize nicely. It
>> will be a worthwhile reason to learn vector programming.
>>
>
> So the half-cyclic convolution is a pain to vectorize, and the compiler
> certainly won't do it for you.
>
> I implemented multiply and square for avx2. On my workstation at work I
> get 163 cycles/multiply and 121/square. Which isn't competitive with the
> state of the art, but it's close.
>
> If you wanted differential power resistance, they may be competitive,
> because they get it for (almost) free. (Instead of setting the accumulator
> to 0, you broadcast a random value 0 t-1)
>
> Someone with more vectorizing experience may be able to push them further,
> but that's a lot of work.
>
> If anyone is interested in the code, I can clean it up and post it.
>

I figured I'd write a last follow-up on some of the work I'd done before
leaving it.

I realized that you could use division to reduce the coefficients by t
instead of montgomery style reduction. For an arbitrary t, that would be
expensive, but for t = (2^26 - 2) It's relatively cheap.
shift right by 26, that becomes a carry to the next coefficient, and add
back 2 times that as an error term. You have to do this twice because the
terms are bigger than t^2. You can find several t's
that are close enough to a power of 2 to use this method of reduction.
(2^26 - 2) is really nice in that 2 has hamming weight 1.

p = phi 19 t     is  3 mod 4, for cheap square roots.

If you reduce this way, the reduction is exact, which is nice for 2 reasons:
Equality checking: Subtract, reduce once. All terms will be the same if the
values were equal.
Bits: Because the reduction is exact, you can use a slightly larger t
value. t can be slightly larger than 27 bits,
For instance t = 2^27 + 9, p = phi 19 t is 486 bits. 9 requires an extra
addition, but not an extra shift, because you have to compute the carry
anyway.

I did something else as well. I realized you could view the Granger Moss
construction as a Karatsuba level, and tried t = (2^112 - 8), and p = phi 5
t
I implemented a reduced carry version. You get the first quotient for free
because shifting right by 112 is just the upper results of the product.

I was able to get multiply reduce around 190 cycles implemented that way.
Reduction was 26 of those cycles, and auto-vectorized nicely.
I just used intel intrinsics for mulx and addc, no vector intrinsics.

I may try the 32 bit version on skylake to see how much more I can get out
of avx-512. The majority of the multiplies can fill a 512 bit vector,
lane masking helps with the ones that can't. It also has embedded
broadcast, which should help out a lot. The way I got performance from the
cyclic convolution was to tilt it so that half of each operand was a
broadcast. That along with the increased registers should lower the # of
spills.

The carry propagation runs in 3 passes instead of 5, and avx-512 has 64 bit
arithmetic right shift,
and 64-32 bit packing, used to finish the reduction.

I may also try Neon. The same 4-way vectorization that I did for avx2
splits into 2-way vectorization, with long multiply accumulate.

>
>
>>
>>> Anyway, I’ll be interested to know what performance you get.
>>>
>>> > On Sep 8, 2017, at 9:21 PM, Kyle Butt <kyle at iteratee.net> wrote:
>>> >
>>> > TLDR: How much do square roots matter for ECC primes?
>>> >
>>> > I've been working on finding Granger Moss primes where t fits in a 32
>>> bit integer.
>>> > Along the way, I worked out tighter bounds for the min and max values
>>> after reduction.
>>> >
>>> > In the paper, they aren't explicit about how much extra room is needed
>>> to handle additions for curve operations. For edwards curves, you need to
>>> account for a factor of 6. This changes their formula
>>> > ceil(log_2(m / 2)) + 2k + 5 <= 2w
>>> > to:
>>> > ceil(log_2(3 * m) + 2k + 5 <= 2w
>>> >
>>> > for m + 1 = 17 and 19, this requires k < 26.5
>>> >
>>> > Similarly for their reduction algorithm, they work in bits, but I did
>>> the math on the actual bounds.
>>> >
>>> > Assuming z_i = 2^63 - 1, after 1 reduction the max value is 2^(63 - l)
>>> - 1 + (t - c)
>>> > t - c = (b - 1) * c, the max value carried from z_(i+1), assuming c <
>>> b, after 2 reductions the max value is:
>>> > 2^(63 - 2*l) - 1 + c + (t - c).  = 2^(63 - 2*l) + t - 1. The first c
>>> occurs because t/b = c.
>>> > The minimum value is easier to calculate. In this case, the worst case
>>> carry is 0.
>>> > so the min value is 2^(63 - 2 * l).
>>> >
>>> > Upon multiplying these values are subtracted. If their difference is
>>> less than 2^27.5, then we can adjust the formula above to:
>>> >
>>> > ceil(log_2(3 * m) + 2k + 4 <= 2w
>>> >
>>> > This does place a greater constraint on l, but it yields larger primes.
>>> >
>>> > because we know that the result of the product takes 1 less bit.
>>> > This allows us to construct larger primes for m + 1 = 17, 19.
>>> > For m+1 = 19 we can construct a prime with 483 bits, specifically:
>>> > p = phi 18 t, where t = (2^3 - 1) * (2^24); b = 2^24; l = 24
>>> >
>>> > Unfortunately, these primes are not fast square root primes, by
>>> construction. It should be clear that p % (2^24) = 1. How much does this
>>> matter? S is 24 in this case, so Tonelli Shanks should be reasonably fast.
>>> Equality checking is also a little tricky due to the redundancy, but a
>>> subtraction single round of modular carries can be done in parallel. Then
>>> you can verify that all the limbs are the same value.
>>> >
>>> > I found an edwards curve that satisfies the safecurves criteria, the
>>> constant is unusual but rigid:
>>> > It is the least d in absolute value such that x^2 + y^2 = 1 +
>>> (d/b)x^2*y^2 has cofactor {4, 8} and so does its twist. The reason for
>>> choosing such a d is that coefficient multiplication requires reduction,
>>> which means that even for small constants, they must be in the montgomery
>>> domain, taking up at least 2 limbs. But for a constant d/b it can be a
>>> single limb.
>>> >
>>> > the least such d is: -56904
>>> > for that curve the trace is: 180587655261632913936542935860
>>> 4192390835234313183868353172046570215054410
>>> >
>>> > I prototyped the code in Haskell, for quick turnaround and so that I
>>> have something where I can print the intermediate values as test vectors
>>> for other implementations. I can tidy it up and share it if there's
>>> interest. It's not written for speed but for verification.
>>> >
>>> > I plan to try and implement a cuda or opencl version, to try and take
>>> advantage of the parallel nature of the primes.
>>> >
>>> > Thoughts or suggestions? Would it be worthwhile to write any of this
>>> up in a more formal setting?
>>> >
>>> > I think at a minimum borrowing the half bit by computing tighter
>>> bounds is pretty cool.
>>> >
>>> > Kyle
>>> > _______________________________________________
>>> > Curves mailing list
>>> > Curves at moderncrypto.org
>>> > https://moderncrypto.org/mailman/listinfo/curves
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://moderncrypto.org/mail-archive/curves/attachments/20170927/8c1db1b0/attachment.html>