[curves] FourQ

Wed Sep 16 03:21:19 PDT 2015

Trevor Perrin writes:
> I don't understand your criticism

People who are being encouraged to venture beyond conservative ECC
should be given an accurate picture of what the speedup actually is.

Certainly there _is_ a speedup. This isn't news; see, e.g., the Kummer
paper and the literature cited there. The problem is that the FourQ
paper quantitatively _exaggerates_ the FourQ speedup. Consider, for
example, the following statement from the paper:

   When considering the results for the DH key exchange, FourQ performs
   1.8--3.5 times faster than Curve25519.

The ratios here come from Table 5, dividing the "ephem. DH" numbers
(what they mean is one-time DH: fixed-base time + variable-base time)
between

   * the "FourQ" column and
   * the best number in the "Curve25519" columns.

For example, the "3.5" claim comes from the Haswell row of the table,
where the FourQ paper reports

   * "92" thousand Haswell cycles for FourQ, from 59000 variable-base
     plus 33000 fixed-base, and

   * "324" thousand Haswell cycles for Curve25519, from 162000
     variable-base plus 162000 fixed-base.

It's true that 324/92 is 3.5. What's wildly inaccurate is the claim that
Curve25519 requires 162000 Haswell cycles for fixed-base scalar
multiplication. This claim was already massively disproven by Andrew
Moon's published software years ago, which the authors fail to cite or
compare to even though I specifically told them months ago that they
were obliged to compare to it. They do cite Tung Chou's newer software,
which was also publicly available months before the first version of
this paper and also massively disproves the same claim, but they fail to
benchmark that software on Haswell.

Most applications don't care about fixed-base speed. That's why, for
many years, nobody bothered writing fast fixed-base Curve25519 code,
even though everybody knew that huge speedups were possible. Eventually
people _did_ write fast fixed-base Curve25519 code, and any application
that cares about fixed-base performance will simply use that code. The
FourQ paper's pretense that these speedups don't exist---in particular,
its failure to compare to this code---really isn't defensible.

The "Pineview" and "Kaveri" figures are corrupted in the same way, and
on top of that are missing an important disclaimer that the previous
literature didn't bother optimizing for these platforms. To some extent
this is also the Haswell situation---for example, the 162000 cycles are
from the Westmere-optimized code in the Ed25519 paper---although Kummer
did target Haswell, and the gap between Ivy Bridge and Haswell isn't as
large as the gap to other CPUs.

The reason that this is important is that the literature clearly shows
that proper ECC optimization is CPU-specific. Anybody can take some
obscure CPU M that hasn't been seriously targeted before and produce
speedups on M. This is helpful for people who want the best performance
on M, it can be useful information regarding M optimization techniques,
and it can predict that some primitives will have an advantage on M---
but reporting "speedup from non-M-optimized implementation of X to
M-optimized implementation of Y" as if it were "speedup from X to Y" is
unreasonable.

Exaggerations are, unfortunately, pervasive in the performance picture
painted in the FourQ paper.

---Dan