[noise] ChaCha speed update

D. J. Bernstein djb at cr.yp.to
Tue Jul 25 15:56:58 PDT 2017


A year ago I wrote:
> Concretely, several generations of Intel chips have run 12-round
> ChaCha12-256 at practically the same speed as 12-round AES-192 (with a
> similar security margin), even though AES-192 has "hardware support", a
> smaller key, a smaller block size, and smaller data limits. For example:
> 
>    * Both ciphers are ~1.7 cycles/byte on Westmere (introduced 2010).
>    * Both ciphers are ~1.5 cycles/byte on Ivy Bridge (introduced 2012).
>    * Both ciphers are ~0.8 cycles/byte on Skylake (introduced 2015).
> 
> ChaCha20-256 is slower than ChaCha12-256 but this is entirely because it
> has a much larger security margin. For reasons explained below, I
> wouldn't be surprised to see ChaCha20-256 running _faster_ than AES-256
> on future Intel chips.

Romain Dolbeau has now submitted benchmarks for an Intel Skylake with
AVX-512:

   * 0.48 cycles/byte: 12-round ChaCha12-256.
   * 0.48 cycles/byte: 12-round Salsa20/12-256.
   * 0.69 cycles/byte: 20-round ChaCha20-256.
   * 0.69 cycles/byte: 20-round Salsa20-256.
   * 0.87 cycles/byte: 14-round AES-256.

https://bench.cr.yp.to/results-stream.html#amd64-manny1024

Thanks to Intel for giving up on AES and joining the monoculture! :-)
The code is C code from Dolbeau, using _mm512_rol_epi32() etc.

Meanwhile there's a burst of papers this month from people struggling
with the security limitations of AES:

   https://eprint.iacr.org/2017/697
   https://eprint.iacr.org/2017/702
   https://eprint.iacr.org/2017/708

Admittedly, the performance comparison is currently the other way around
on AMD Ryzen, which is basically a 128-bit machine (256-bit instructions
take two operations) with two AES units. But the gap will close. Mixing
integer operations with vector operations will speed up Salsa and ChaCha
on these chips, as on NEON. More importantly, subsequent AMD chips, just
like Intel chips, will be pressured to improve performance of
general-purpose vector instructions.

---Dan

P.S. In news that's not unrelated, the soon-to-be-online "NTRU Prime"
software includes int32-array-sorting software that, on modern Intel
CPUs, solidly beats Intel's "performance" library.


More information about the Noise mailing list