<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Sep 11, 2017 at 10:37 AM, Kyle Butt <span dir="ltr"><<a href="mailto:kyle@iteratee.net" target="_blank">kyle@iteratee.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="auto"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote"><span class="">On Fri, Sep 8, 2017 at 9:55 PM, Mike Hamburg <span dir="ltr"><<a href="mailto:mike@shiftleft.org" target="_blank">mike@shiftleft.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I might possibly have written an ECC implementation that documents: “The following operations are not implemented for NIST P224 because I didn’t want to compute square roots: …”<br> <br> At least with only 2^24+1, you don’t have to implement Sutherland’s algorithm. But the cost of Tonelli-Shanks is nontrivial, since it can still take hundreds of multiplications.<br></blockquote><div><br></div></span><div>I implemented it and added counting. For the first 10,000 residues the average # of multiplies for the iterative portion is 183. 22 of those would have to be done anyway if p were 3 mod 4.</div><div>So 161 extra field multiplications for square roots. Taking the square root of u/v requires an additional 24 squares and 9 multiplies.</div><div>I at least figured out how to do equality checks in band without having to reduce the value completely. My insight was to set the first component to t + t/2, and then do a parallel carry. If the value is 0, all the carries will be done, and the you check that all components are the same.</div><div><br></div><div>I'll try to get some speed #'s but it's at least nice to know that Tonelli-Shanks isn't too difficult to implement (at least in Haskell), and that the # of multiplies isn't insurmountable.</div><div><br></div><div>For example, my phi 17 prime has p-1 = 2^22*q, 431 bits, and 136 mulitplies per field multiply or square. Comparing 24 vs 22, you should expect 133 extra field multiplies per square root.</div><div>The Ed41417 prime paper lists 144 multiplies per square. Saving 8 mulitplies per field multiplication, with ~8 field multiplies per point addition is 64 multiplies per point, saving a field multiply just over every 2 point additions.</div><div>It's easy to see that over a scalar multiply, you could theoretically make it up.</div><div><br></div><div>I've never done CUDA/OpenCL, but these should all vectorize nicely. It will be a worthwhile reason to learn vector programming.</div></div></div></div></div></blockquote><div><br></div><div>So the half-cyclic convolution is a pain to vectorize, and the compiler certainly won't do it for you.</div><div><br></div><div>I implemented multiply and square for avx2. On my workstation at work I get 163 cycles/multiply and 121/square. Which isn't competitive with the state of the art, but it's close.</div><div><br></div><div>If you wanted differential power resistance, they may be competitive, because they get it for (almost) free. (Instead of setting the accumulator to 0, you broadcast a random value 0 t-1)</div><div><br></div><div>Someone with more vectorizing experience may be able to push them further, but that's a lot of work.</div><div><br></div><div>If anyone is interested in the code, I can clean it up and post it.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="auto"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><div class="h5"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <br> Anyway, I’ll be interested to know what performance you get.<br> <div><div class="m_-3453921142322497152m_6492298648672501180h5"><br> > On Sep 8, 2017, at 9:21 PM, Kyle Butt <<a href="mailto:kyle@iteratee.net" target="_blank">kyle@iteratee.net</a>> wrote:<br> ><br> > TLDR: How much do square roots matter for ECC primes?<br> ><br> > I've been working on finding Granger Moss primes where t fits in a 32 bit integer.<br> > Along the way, I worked out tighter bounds for the min and max values after reduction.<br> ><br> > In the paper, they aren't explicit about how much extra room is needed to handle additions for curve operations. For edwards curves, you need to account for a factor of 6. This changes their formula<br> > ceil(log_2(m / 2)) + 2k + 5 <= 2w<br> > to:<br> > ceil(log_2(3 * m) + 2k + 5 <= 2w<br> ><br> > for m + 1 = 17 and 19, this requires k < 26.5<br> ><br> > Similarly for their reduction algorithm, they work in bits, but I did the math on the actual bounds.<br> ><br> > Assuming z_i = 2^63 - 1, after 1 reduction the max value is 2^(63 - l) - 1 + (t - c)<br> > t - c = (b - 1) * c, the max value carried from z_(i+1), assuming c < b, after 2 reductions the max value is:<br> > 2^(63 - 2*l) - 1 + c + (t - c). = 2^(63 - 2*l) + t - 1. The first c occurs because t/b = c.<br> > The minimum value is easier to calculate. In this case, the worst case carry is 0.<br> > so the min value is 2^(63 - 2 * l).<br> ><br> > Upon multiplying these values are subtracted. If their difference is less than 2^27.5, then we can adjust the formula above to:<br> ><br> > ceil(log_2(3 * m) + 2k + 4 <= 2w<br> ><br> > This does place a greater constraint on l, but it yields larger primes.<br> ><br> > because we know that the result of the product takes 1 less bit.<br> > This allows us to construct larger primes for m + 1 = 17, 19.<br> > For m+1 = 19 we can construct a prime with 483 bits, specifically:<br> > p = phi 18 t, where t = (2^3 - 1) * (2^24); b = 2^24; l = 24<br> ><br> > Unfortunately, these primes are not fast square root primes, by construction. It should be clear that p % (2^24) = 1. How much does this matter? S is 24 in this case, so Tonelli Shanks should be reasonably fast. Equality checking is also a little tricky due to the redundancy, but a subtraction single round of modular carries can be done in parallel. Then you can verify that all the limbs are the same value.<br> ><br> > I found an edwards curve that satisfies the safecurves criteria, the constant is unusual but rigid:<br> > It is the least d in absolute value such that x^2 + y^2 = 1 + (d/b)x^2*y^2 has cofactor {4, 8} and so does its twist. The reason for choosing such a d is that coefficient multiplication requires reduction, which means that even for small constants, they must be in the montgomery domain, taking up at least 2 limbs. But for a constant d/b it can be a single limb.<br> ><br> > the least such d is: -56904<br> > for that curve the trace is: 180587655261632913936542935860<wbr>419239083523431318386835317204<wbr>6570215054410<br> ><br> > I prototyped the code in Haskell, for quick turnaround and so that I have something where I can print the intermediate values as test vectors for other implementations. I can tidy it up and share it if there's interest. It's not written for speed but for verification.<br> ><br> > I plan to try and implement a cuda or opencl version, to try and take advantage of the parallel nature of the primes.<br> ><br> > Thoughts or suggestions? Would it be worthwhile to write any of this up in a more formal setting?<br> ><br> > I think at a minimum borrowing the half bit by computing tighter bounds is pretty cool.<br> ><br> > Kyle<br> </div></div>> ______________________________<wbr>_________________<br> > Curves mailing list<br> > <a href="mailto:Curves@moderncrypto.org" target="_blank">Curves@moderncrypto.org</a><br> > <a href="https://moderncrypto.org/mailman/listinfo/curves" rel="noreferrer" target="_blank">https://moderncrypto.org/mailm<wbr>an/listinfo/curves</a><br> <br> </blockquote></div></div></div><br></div></div></div> </blockquote></div><br></div></div>