<div dir="ltr"><div class="gmail_default" style="font-family:verdana,sans-serif">Hi Peter, Mike,</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">Thank you both for your comments and valuable suggestions.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif"> >The first aspect is that the Curve25519 implementation uses secretly<br> >indexed memory access which is a possible source for timing attacks.<br> >State-of-the-art Curve25519 implementations avoid this by using<br> >constant-time conditional swaps.<br> >Similar statements apply to the table lookup in the fixed-basepoint<br> >scalar multiplication, but those would be much more expensive to<br> >protect.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">The constant time implementation of field operations added approximately 20% to the timings. Applying recommendations regarding indexing has a way bigger impact on the overall performance. I am trying to find a compromised solution based on blinding that can justify usage of indexing. <br>The library currently implements randomization of the starting point as well as using (a+b)*P + B instead of a*P where b is random and B = -b*P. </div><div class="gmail_default" style="font-family:verdana,sans-serif">Another method that can be built into pre-calculated tables is to replace base point P with Q = (1/r mod BPO)*P and then use (a*r mod BPO)*Q instead of a*P. The combination of these two methods is another option: ((a-b)*r mod BPO)*Q + S, where S = -b*Q.<br>For embedded applications such as IOT, the randomized Q tables can be generated and programmed into devices during manufacturing and random blinding b generated during power ups.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br>Is this is good enough approach? Does it need to be updated time to time? Please note that I am a software engineer and relying on crypto experts like you to evaluate the security level and come up with recommendations.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br> >The second thing is: It's great to hear about new speed records!<br> >Are you planning to support the SUPERCOP API ...</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">I am planning to support SUPERCOP API once I am happy with the library.<br> <br> >My understanding is that the technique that you call folding is only<br> >efficient for signing; the verification part needs a precomputation<br> >which is just way too expensive: it needs 384 point doublings! You will<br> >probably answer that the ed25519_Verify_Init needs to be done just once<br> >for many verifications, but then a huge expanded public key is sitting<br> >around in memory and if I'm not totally mistaken, a sliding-window<br> >method would still be faster. </div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">The folding technique benefits Signing and Key Gen as well as Verify(partially) for ed25519. Currently curve25519-DH public key calculation also uses folding but I am going to switch back to Montgomery ladder. There is no pre-calculation for these since the tables are part of the code.<br>For the verify operation, the pre-calculation needs 191D and 11A (verify_init function). I do not see how you came up with 384D. The total (pre+post) is 255D and 2W+11 A's where W is the bit size of each fold. There are some potential saving on the 2W part due to 8-folds of base point but I have not figure it out yet. </div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">@Mike, <br>I checked Pippenger's algorithm and did not see any similarity. I am going to have a look at other references. Thanks for providing this comprehensive list. </div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif"> >3) Using multiple tables (so the inner loop is double; add; add; add) lets you reduce the number of doubles without making the tables too wide.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">Can you elaborate on this? Based on the pre-calculations, it is double-add(P)-add(Q). Are you considering 8 vs 4 combs here?</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">Best,<br>Mehdi.</div><div class="gmail_default" style="font-family:verdana,sans-serif"></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jun 30, 2015 at 11:41 AM, Michael Hamburg <span dir="ltr"><<a href="mailto:mike@shiftleft.org" target="_blank">mike@shiftleft.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style><div>I should add to this that I haven’t actually seen a complete comb implementation for Curve25519 before, though there might be one somewhere. I think Donna (including Floodyberry’s Donna fork) uses the older windowed technique. So if you make a nicely tuned implementation, you will be the first. I’m working on one as well, but it uses Decaf so it won’t be directly compatible with curve25519.</div><div><br></div><div>To tune your comb implementation, here are a few ideas:</div><div><br></div><div>1) As Peter said, you should probably use constant-time lookups for signing, though it isn’t necessary for verification. This will limit you to a smaller table.</div><div><br></div><div>2) Signed binary uses half the table space vs unsigned binary, and also is half as expensive if you’re using constant-time lookups. You can convert a scalar to signed binary by adding a certain constant (2^width-1 IIRC) and halving mod the curve order.</div><div><br></div><div>3) Using multiple tables (so the inner loop is double; add; add; add) lets you reduce the number of doubles without making the tables too wide.</div><div><br></div><div>4) For any given size, there's a tradeoff between more tables and bigger tables. You can optimize that tradeoff with a simple script or spreadsheet.</div><div><br></div><div>Cheers,</div><div>— Mike</div><br><div><blockquote type="cite"><div>On Jun 30, 2015, at 10:56 AM, Mike Hamburg <<a href="mailto:mike@shiftleft.org" target="_blank">mike@shiftleft.org</a>> wrote:</div><br><div>
<div text="#000000" bgcolor="#FFFFFF">
Hi Mehdi,<br>
<br>
I think your "folding" trick has been invented and re-invented many
times. I came up with it in 2012, and thought it should be called
"combs". I Googled the term, and found that Pippenger invented it
in 1979. Lim and Lee reinvented it in 1994.
Hankerson-Menezes-Vanstone improved it in 2004 (by using multiple
tables), as did Hedabou (by using signed all-bits-set binary); Feng
et al published a different version in 2006 (signed-MSB-set). The
first Ed25519 implementation used a less efficient, non-combed
variant of the same technique.<br>
<br>
Some of the variants are patented, too, but I think your variant is
similar to Lim-Lee and so should probably be out of patent by now.<br>
<br>
I think the best variant these days is to combine the 2004
techniques: multiple combs, signed-all-bits-set. It's simpler than
signed-MSB-set, and I think it's the fastest once you have the
table. The table is slightly slower to compute than signed-MSB-set,
which makes that algorithm better in a few limited circumstances,
but SMSBS is also patented (by Microsoft) and SABS is unpatented to
the best of my knowledge.<br>
<br>
You're right that fixed-point techniques can be useful for
verification and not just signing. For example, in a
vehicle-to-vehicle setting, you will see many signed messages from
each car going the same direction as you. Verification latency is
important, so precomputing a small table on another core may be
worthwhile if memory permits. A small table can be worthwhile even
for only 2 signatures, and a larger one for 3.<br>
<br>
-- Mike<br>
<br>
<div>On 06/30/2015 10:26 AM, Peter Schwabe
wrote:<br>
</div>
<blockquote type="cite">
<pre>Mehdi Sotoodeh <a href="mailto:mehdisotoodeh@gmail.com" target="_blank"><mehdisotoodeh@gmail.com></a> wrote:
Dear Mehdi,
</pre>
<blockquote type="cite">
<pre>I would like to introduce a remarkable implementation of x25519 and ed25519
library. The sources are hosted at: <a href="https://github.com/msotoodeh/curve25519" target="_blank">https://github.com/msotoodeh/curve25519</a>
The code is experimental but rather stable. It is compact, portable
and uses simple design logic. On the security front, it employs
several measures for side-channel security.
</pre>
</blockquote>
<pre>I only took a quick look at the software, but two things immediately
caught my eye.
The first aspect is that the Curve25519 implementation uses secretly
indexed memory access which is a possible source for timing attacks.
State-of-the-art Curve25519 implementations avoid this by using
constant-time conditional swaps.
Similar statements apply to the table lookup in the fixed-basepoint
scalar multiplication, but those would be much more expensive to
protect.
</pre>
<blockquote type="cite">
<pre>But the most remarkable feature is speed. This library sets new speed
records. It uses a new technique I call it FOLDING for achieving this goal.
FOLDING chops the scalar multiplier into n pieces (or folds) and operates
on the folds simultaneously reducing number of point operations by a factor
of 4 or 8. For example, ed25519 signature takes 31 point doubling and 31
point additions.
</pre>
</blockquote>
<pre>The second thing is: It's great to hear about new speed records!
Are you planning to support the SUPERCOP API and submit to eBATS so that
the software can be publicly benchmarked on a large bunch of computers?
My understanding is that the technique that you call folding is only
efficient for signing; the verification part needs a precomputation
which is just way too expensive: it needs 384 point doublings! You will
probably answer that the ed25519_Verify_Init needs to be done just once
for many verifications, but then a huge expanded public key is sitting
around in memory and if I'm not totally mistaken, a sliding-window
method would still be faster.
Best regards,
Peter
</pre>
<br>
<fieldset></fieldset>
<br>
<pre>_______________________________________________
Curves mailing list
<a href="mailto:Curves@moderncrypto.org" target="_blank">Curves@moderncrypto.org</a>
<a href="https://moderncrypto.org/mailman/listinfo/curves" target="_blank">https://moderncrypto.org/mailman/listinfo/curves</a>
</pre>
</blockquote>
<br>
</div>
_______________________________________________<br>Curves mailing list<br><a href="mailto:Curves@moderncrypto.org" target="_blank">Curves@moderncrypto.org</a><br><a href="https://moderncrypto.org/mailman/listinfo/curves" target="_blank">https://moderncrypto.org/mailman/listinfo/curves</a><br></div></blockquote></div><br></div></blockquote></div><br></div>