[curves] E-521

Michael Hamburg mike at shiftleft.org
Mon Oct 27 16:42:02 PDT 2014


I just checked in a (still experimental!) mod to the Goldilocks code which enables p521.  Compile with:

make clean all test bench FIELD=p521 CC=clang XCFLAGS=-DGOLDI_FIELD_BITS=521 ARCH=arch_x86_64_r12

Probably requires AVX2.

Current Haswell cycle count:
	keygen: 270kcy
	ecdh: 803kcy
	sign: 283kcy
	verif: 907kcy

This is with the same high-level algorithms as the other Goldilocks benchmarks: constant-time, no variable lookups, compressed points, includes the hashing.  HT and turbo off.

So E-521 can be almost as fast as Granger and Scott said (or maybe faster, who knows), but it’s tricky to get there.

Cheers,
— Mike



> On Oct 23, 2014, at 6:38 PM, Michael Hamburg <mike at shiftleft.org> wrote:
> 
>> 
>> On Oct 23, 2014, at 10:05 AM, Trevor Perrin <trevp at trevp.net> wrote:
>> 
>> On Thu, Oct 23, 2014 at 5:04 AM, Samuel Neves <sneves at dei.uc.pt> wrote:
>>> 
>>> The Haswell cycle counts mentioned in the paper do not take Turbo Boost into account, and therefore are lower than the
>>> real number; taking into account that the Core i7 4770 chip was used (3.4 to 3.9 GHz overclocking), the Haswell cycle
>>> count should be ~893000.  I have been able to get this slightly down to ~884000.
>>> 
>>> On Sandy Bridge, I get somewhat better timings than reported by DJB: ~1030000 cycles.
>> 
>> Thanks!, updated [1].
>> 
>> By that scoring, Mike's Goldilocks implementation retains the
>> "relative efficiency" crown.  But the E-521 numbers are without ASM
>> optimization.  And their 9 limbs / 58-bit radix seems impressive
>> (Goldlilocks uses 8 limbs / 56-bit radix).
>> 
>> So this seems pretty close, I wonder what a better-optimized 521 could do...
>> 
>> 
>> Trevor
>> 
>> 
>> [1] https://docs.google.com/a/trevp.net/spreadsheet/ccc?key=0Aiexaz_YjIpddFJuWlNZaDBvVTRFSjVYZDdjakxoRkE&usp=sharing#gid=0
> 
> The Goldilocks code is almost ready to support E-521.  As a warmup non-Ed448 curve, I took preliminary benchmarks for Ed480-Ridinghood.  From one benchmark run (not SUPERCOP, etc):
>        Goldilocks: 178kcy keygen, 536kcy ecdh
>        Ridinghood: 193kcy keygen, 617kcy ecdh
> Difference = +8%, +15%.
> 
> The +15% reflects some sections which aren’t optimized yet, along the lines of if (EDWARDS_D > 0) { do something slow; } or if (Mike hasn’t calculated the carry handling limits yet) { reduce just to be safe; }
> 
> I also have a 521-bit multiplier which takes 145 Haswell cycles in preliminary benchmarks.  Like Granger-Scott, it uses 9 limbs of 58 bits each.  It’s still using 3-way Chung-Hasan, so it does more multiplies and fewer adds than the Granger-Scott technique.  Its speed advantage, if it actually has one, is probably from tighter tuning.  But if that’s accurate it might be comparably fast to what Granger and Scott quoted (but measured properly, with TurboBoost off).
> 
> — Mike
> _______________________________________________
> Curves mailing list
> Curves at moderncrypto.org
> https://moderncrypto.org/mailman/listinfo/curves



More information about the Curves mailing list