mike at shiftleft.org
Wed Sep 30 01:44:54 PDT 2015
Interestingly, the new Intel optimization manual says that Skylake can issue (V)PMULQDQ twice every cycle instead of once, presumably even for 4-way (256-bit) multiplications. So “Sandy2x” and other designs with similar strategies (including for Goldilocks) should be competitive on Skylake as well.
> On Sep 29, 2015, at 10:29 PM, Trevor Perrin <trevp at trevp.net> wrote:
> Tung Chou's "Sandy2x" code for 25519 on Sandy Bridge and Ivy Bridge is
> around 10-20% faster than other implementations:
> Speedup is attributed to using the 2-way 32x32->64 vectorized
> multiplier (vpmuludq) instead of the 64x64->128 serialized multiplier.
> The paper doesn't say whether this strategy also pays off on Haswell
> (which seems to be lagging in 25519 performance?):
> Curves mailing list
> Curves at moderncrypto.org
More information about the Curves