[curves] Balancing reduced-radix and full-radix performance for extra-strength primes

Mon Jan 19 15:08:19 PST 2015

Thanks Trevor!

> On Jan 19, 2015, at 1:47 PM, Trevor Perrin <trevp at trevp.net> wrote:
> 
> At Real World Crypto, Patrick Longa discussed the question of choosing
> efficient field primes for extra-strength curves.  He suggested such
> primes should be chosen to strike a balance between efficiency in
> full-radix and reduced-radix implementations (i.e. "saturated" and
> "unsaturated" arithmetic):
> 
> http://www.realworldcrypto.com/rwc2015/program-2/RWC-2015-Longa.pdf?attredirects=0
> 
> 2^379 - 19 was given as an example.
> 
> Most of the discussion of extra-strength primes I've seen assumes
> reduced-radix implementations, so this is an interesting twist.  His
> argument for considering full-radix was:
> 
> 1) Full-radix is more efficient on several platforms ("AMD, "Intel
> Atom, Intel Quark, ARM w/o Neon, microcontrollers"), whereas
> reduced-radix is most advantageous on more advanced processors ("Intel
> desktop/server", "ARM with NEON”).

This makes sense.  Some points

* Intel’s desktop/server parts have a really fast MUL and slowish ADC compared to everything else, so they favor reduced-radix computations which have more MULs and fewer ADCs.

* Anything running on a vector unit (NEON) won’t have access to ADC at all, so it will suck on full-radix.

* Medium-to-high-end ARM parts (M4 or higher) have an instruction UMAAL to accelerate full-radix arithmetic, though it still isn’t as fast as NEON.  They also have an instruction UMLAL which accelerates reduced-radix arithmetic.

* Low-end ARMs have UMLAL but not UMAAL.  Some might not even have that.  Parts with UMLAL and not UMAAL will favor reduced-radix.

* Other platforms may be in-between, if they don’t have acceleration for either.  A possible exception is RISC architectures with no carry flag and no magic accum register (eg RISC-V), which tend to perform terribly on full-radix code.

* Superscalar cores can multi-issue ADDs for the reduced-radix add/sub routine, but can’t multi-issue ADC.

* The advantage of full-radix decreases as the number of limbs increases, because it’s usually n^2 vs (n+1)^2.  That’s why for Curve25519 favors full-radix.

So the message that different processors favor different organizations is correct, and it sort of correlates with CPU size, but not perfectly.

> 2) Full-radix may be safer and easier to implement, since
> reduced-radix requires "Bound analysis" to prevent inadvertent word
> spilling, thus is "error prone, errors are more difficult to catch”.

Maybe.  With full-radix it can be annoying to get the carries right, though it’s easier if the size is not exactly a multiple of 64 (eg 2^379 instead of 2^384) because then there’s headroom.

> I'm not qualified to assess (1).  But if it's true that full-radix
> implementations are faster on more primitive platforms that seems
> significant, since optimizing for these is probably more important
> than achieving maximum speed on advanced processors.

I agree.

> Anyways, I'm curious what other people think of this approach, and of
> 2^379 - 19 in particular.

Seems fine to me.  The prime 2^398 - 21 did bug me a little bit for this reason.

> The slides give performance numbers for a "Ted37919" curve based on
> this prime on Sandy Bridge, which I added to the spreadsheet here:

> https://docs.google.com/a/trevp.net/spreadsheet/ccc?key=0Aiexaz_YjIpddFJuWlNZaDBvVTRFSjVYZDdjakxoRkE&usp=sharing#gid=0 <https://docs.google.com/a/trevp.net/spreadsheet/ccc?key=0Aiexaz_YjIpddFJuWlNZaDBvVTRFSjVYZDdjakxoRkE&usp=sharing#gid=0>

NB those numbers do not include point decompression.

Cheers,
— Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://moderncrypto.org/mail-archive/curves/attachments/20150119/1b4ff5f7/attachment.html>