<div dir="ltr"><div class="gmail_default" style="font-family:verdana,sans-serif">Hi Peter, Mike,</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">Thank you both for your comments and valuable suggestions.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif"> >The first aspect is that the Curve25519 implementation uses secretly<br> >indexed memory access which is a possible source for timing attacks.<br> >State-of-the-art Curve25519 implementations avoid this by using<br> >constant-time conditional swaps.<br> >Similar statements apply to the table lookup in the fixed-basepoint<br> >scalar multiplication, but those would be much more expensive to<br> >protect.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">The constant time implementation of field operations added approximately 20% to the timings. Applying recommendations regarding indexing has a way bigger impact on the overall performance. I am trying to find a compromised solution based on blinding that can justify usage of indexing. <br>The library currently implements randomization of the starting point as well as using (a+b)*P + B instead of a*P where b is random and B = -b*P. </div><div class="gmail_default" style="font-family:verdana,sans-serif">Another method that can be built into pre-calculated tables is to replace base point P with Q = (1/r mod BPO)*P and then use (a*r mod BPO)*Q instead of a*P. The combination of these two methods is another option: ((a-b)*r mod BPO)*Q + S, where S = -b*Q.<br>For embedded applications such as IOT, the randomized Q tables can be generated and programmed into devices during manufacturing and random blinding b generated during power ups.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br>Is this is good enough approach? Does it need to be updated time to time? Please note that I am a software engineer and relying on crypto experts like you to evaluate the security level and come up with recommendations.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br> >The second thing is: It's great to hear about new speed records!<br> >Are you planning to support the SUPERCOP API ...</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">I am planning to support SUPERCOP API once I am happy with the library.<br> <br> >My understanding is that the technique that you call folding is only<br> >efficient for signing; the verification part needs a precomputation<br> >which is just way too expensive: it needs 384 point doublings! You will<br> >probably answer that the ed25519_Verify_Init needs to be done just once<br> >for many verifications, but then a huge expanded public key is sitting<br> >around in memory and if I'm not totally mistaken, a sliding-window<br> >method would still be faster. </div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">The folding technique benefits Signing and Key Gen as well as Verify(partially) for ed25519. Currently curve25519-DH public key calculation also uses folding but I am going to switch back to Montgomery ladder. There is no pre-calculation for these since the tables are part of the code.<br>For the verify operation, the pre-calculation needs 191D and 11A (verify_init function). I do not see how you came up with 384D. The total (pre+post) is 255D and 2W+11 A's where W is the bit size of each fold. There are some potential saving on the 2W part due to 8-folds of base point but I have not figure it out yet. </div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">@Mike, <br>I checked Pippenger's algorithm and did not see any similarity. I am going to have a look at other references. Thanks for providing this comprehensive list. </div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif"> >3) Using multiple tables (so the inner loop is double; add; add; add) lets you reduce the number of doubles without making the tables too wide.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">Can you elaborate on this? Based on the pre-calculations, it is double-add(P)-add(Q). Are you considering 8 vs 4 combs here?</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">Best,<br>Mehdi.</div><div class="gmail_default" style="font-family:verdana,sans-serif"></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jun 30, 2015 at 11:41 AM, Michael Hamburg <span dir="ltr"><<a href="mailto:mike@shiftleft.org" target="_blank">mike@shiftleft.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style><div>I should add to this that I haven’t actually seen a complete comb implementation for Curve25519 before, though there might be one somewhere. I think Donna (including Floodyberry’s Donna fork) uses the older windowed technique. So if you make a nicely tuned implementation, you will be the first. I’m working on one as well, but it uses Decaf so it won’t be directly compatible with curve25519.</div><div><br></div><div>To tune your comb implementation, here are a few ideas:</div><div><br></div><div>1) As Peter said, you should probably use constant-time lookups for signing, though it isn’t necessary for verification. This will limit you to a smaller table.</div><div><br></div><div>2) Signed binary uses half the table space vs unsigned binary, and also is half as expensive if you’re using constant-time lookups. You can convert a scalar to signed binary by adding a certain constant (2^width-1 IIRC) and halving mod the curve order.</div><div><br></div><div>3) Using multiple tables (so the inner loop is double; add; add; add) lets you reduce the number of doubles without making the tables too wide.</div><div><br></div><div>4) For any given size, there's a tradeoff between more tables and bigger tables. You can optimize that tradeoff with a simple script or spreadsheet.</div><div><br></div><div>Cheers,</div><div>— Mike</div><br><div><blockquote type="cite"><div>On Jun 30, 2015, at 10:56 AM, Mike Hamburg <<a href="mailto:mike@shiftleft.org" target="_blank">mike@shiftleft.org</a>> wrote:</div><br><div> <div text="#000000" bgcolor="#FFFFFF"> Hi Mehdi,<br> <br> I think your "folding" trick has been invented and re-invented many times. I came up with it in 2012, and thought it should be called "combs". I Googled the term, and found that Pippenger invented it in 1979. Lim and Lee reinvented it in 1994. Hankerson-Menezes-Vanstone improved it in 2004 (by using multiple tables), as did Hedabou (by using signed all-bits-set binary); Feng et al published a different version in 2006 (signed-MSB-set). The first Ed25519 implementation used a less efficient, non-combed variant of the same technique.<br> <br> Some of the variants are patented, too, but I think your variant is similar to Lim-Lee and so should probably be out of patent by now.<br> <br> I think the best variant these days is to combine the 2004 techniques: multiple combs, signed-all-bits-set. It's simpler than signed-MSB-set, and I think it's the fastest once you have the table. The table is slightly slower to compute than signed-MSB-set, which makes that algorithm better in a few limited circumstances, but SMSBS is also patented (by Microsoft) and SABS is unpatented to the best of my knowledge.<br> <br> You're right that fixed-point techniques can be useful for verification and not just signing. For example, in a vehicle-to-vehicle setting, you will see many signed messages from each car going the same direction as you. Verification latency is important, so precomputing a small table on another core may be worthwhile if memory permits. A small table can be worthwhile even for only 2 signatures, and a larger one for 3.<br> <br> -- Mike<br> <br> <div>On 06/30/2015 10:26 AM, Peter Schwabe wrote:<br> </div> <blockquote type="cite"> <pre>Mehdi Sotoodeh <a href="mailto:mehdisotoodeh@gmail.com" target="_blank"><mehdisotoodeh@gmail.com></a> wrote: Dear Mehdi, </pre> <blockquote type="cite"> <pre>I would like to introduce a remarkable implementation of x25519 and ed25519 library. The sources are hosted at: <a href="https://github.com/msotoodeh/curve25519" target="_blank">https://github.com/msotoodeh/curve25519</a> The code is experimental but rather stable. It is compact, portable and uses simple design logic. On the security front, it employs several measures for side-channel security. </pre> </blockquote> <pre>I only took a quick look at the software, but two things immediately caught my eye. The first aspect is that the Curve25519 implementation uses secretly indexed memory access which is a possible source for timing attacks. State-of-the-art Curve25519 implementations avoid this by using constant-time conditional swaps. Similar statements apply to the table lookup in the fixed-basepoint scalar multiplication, but those would be much more expensive to protect. </pre> <blockquote type="cite"> <pre>But the most remarkable feature is speed. This library sets new speed records. It uses a new technique I call it FOLDING for achieving this goal. FOLDING chops the scalar multiplier into n pieces (or folds) and operates on the folds simultaneously reducing number of point operations by a factor of 4 or 8. For example, ed25519 signature takes 31 point doubling and 31 point additions. </pre> </blockquote> <pre>The second thing is: It's great to hear about new speed records! Are you planning to support the SUPERCOP API and submit to eBATS so that the software can be publicly benchmarked on a large bunch of computers? My understanding is that the technique that you call folding is only efficient for signing; the verification part needs a precomputation which is just way too expensive: it needs 384 point doublings! You will probably answer that the ed25519_Verify_Init needs to be done just once for many verifications, but then a huge expanded public key is sitting around in memory and if I'm not totally mistaken, a sliding-window method would still be faster. Best regards, Peter </pre> <br> <fieldset></fieldset> <br> <pre>_______________________________________________ Curves mailing list <a href="mailto:Curves@moderncrypto.org" target="_blank">Curves@moderncrypto.org</a> <a href="https://moderncrypto.org/mailman/listinfo/curves" target="_blank">https://moderncrypto.org/mailman/listinfo/curves</a> </pre> </blockquote> <br> </div> _______________________________________________<br>Curves mailing list<br><a href="mailto:Curves@moderncrypto.org" target="_blank">Curves@moderncrypto.org</a><br><a href="https://moderncrypto.org/mailman/listinfo/curves" target="_blank">https://moderncrypto.org/mailman/listinfo/curves</a><br></div></blockquote></div><br></div></blockquote></div><br></div>