blueprint at crypto.tw
Wed Sep 30 06:16:45 PDT 2015
Sandy2x takes 156076 Haswell cycles for X25519 shared-secret
computation. This is very close to the Ivy Bridge cycles. Note that,
however, the non-vectorized implementation from the Ed25519
paper performs much better on Haswell than on Ivy Bridge:
161648 cycles versus 182708 cycles.
Armando Faz-Hernández and Julio López have a Latincrypt paper
this year about an X25519 implementation targeting for Haswell.
They claim 1565xx Haswell cycles for shared-secret computation.
They use a 4-way vectorized multiplier to perform 2 field
multiplications/squarings at the same time. I think a better
approach would be to find 4 independent multiplications/squarings
in the formula and vectorize across them, but I haven't tried.
On Tue, Sep 29, 2015 at 10:29 PM, Trevor Perrin <trevp at trevp.net> wrote:
> Tung Chou's "Sandy2x" code for 25519 on Sandy Bridge and Ivy Bridge is
> around 10-20% faster than other implementations:
> Speedup is attributed to using the 2-way 32x32->64 vectorized
> multiplier (vpmuludq) instead of the 64x64->128 serialized multiplier.
> The paper doesn't say whether this strategy also pays off on Haswell
> (which seems to be lagging in 25519 performance?):
> Curves mailing list
> Curves at moderncrypto.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Curves