[curves] Balancing reduced-radix and full-radix performance for extra-strength primes
mike at shiftleft.org
Mon Jan 19 15:08:19 PST 2015
> On Jan 19, 2015, at 1:47 PM, Trevor Perrin <trevp at trevp.net> wrote:
> At Real World Crypto, Patrick Longa discussed the question of choosing
> efficient field primes for extra-strength curves. He suggested such
> primes should be chosen to strike a balance between efficiency in
> full-radix and reduced-radix implementations (i.e. "saturated" and
> "unsaturated" arithmetic):
> 2^379 - 19 was given as an example.
> Most of the discussion of extra-strength primes I've seen assumes
> reduced-radix implementations, so this is an interesting twist. His
> argument for considering full-radix was:
> 1) Full-radix is more efficient on several platforms ("AMD, "Intel
> Atom, Intel Quark, ARM w/o Neon, microcontrollers"), whereas
> reduced-radix is most advantageous on more advanced processors ("Intel
> desktop/server", "ARM with NEON”).
This makes sense. Some points
* Intel’s desktop/server parts have a really fast MUL and slowish ADC compared to everything else, so they favor reduced-radix computations which have more MULs and fewer ADCs.
* Anything running on a vector unit (NEON) won’t have access to ADC at all, so it will suck on full-radix.
* Medium-to-high-end ARM parts (M4 or higher) have an instruction UMAAL to accelerate full-radix arithmetic, though it still isn’t as fast as NEON. They also have an instruction UMLAL which accelerates reduced-radix arithmetic.
* Low-end ARMs have UMLAL but not UMAAL. Some might not even have that. Parts with UMLAL and not UMAAL will favor reduced-radix.
* Other platforms may be in-between, if they don’t have acceleration for either. A possible exception is RISC architectures with no carry flag and no magic accum register (eg RISC-V), which tend to perform terribly on full-radix code.
* Superscalar cores can multi-issue ADDs for the reduced-radix add/sub routine, but can’t multi-issue ADC.
* The advantage of full-radix decreases as the number of limbs increases, because it’s usually n^2 vs (n+1)^2. That’s why for Curve25519 favors full-radix.
So the message that different processors favor different organizations is correct, and it sort of correlates with CPU size, but not perfectly.
> 2) Full-radix may be safer and easier to implement, since
> reduced-radix requires "Bound analysis" to prevent inadvertent word
> spilling, thus is "error prone, errors are more difficult to catch”.
Maybe. With full-radix it can be annoying to get the carries right, though it’s easier if the size is not exactly a multiple of 64 (eg 2^379 instead of 2^384) because then there’s headroom.
> I'm not qualified to assess (1). But if it's true that full-radix
> implementations are faster on more primitive platforms that seems
> significant, since optimizing for these is probably more important
> than achieving maximum speed on advanced processors.
> Anyways, I'm curious what other people think of this approach, and of
> 2^379 - 19 in particular.
Seems fine to me. The prime 2^398 - 21 did bug me a little bit for this reason.
> The slides give performance numbers for a "Ted37919" curve based on
> this prime on Sandy Bridge, which I added to the spreadsheet here:
> https://docs.google.com/a/trevp.net/spreadsheet/ccc?key=0Aiexaz_YjIpddFJuWlNZaDBvVTRFSjVYZDdjakxoRkE&usp=sharing#gid=0 <https://docs.google.com/a/trevp.net/spreadsheet/ccc?key=0Aiexaz_YjIpddFJuWlNZaDBvVTRFSjVYZDdjakxoRkE&usp=sharing#gid=0>
NB those numbers do not include point decompression.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Curves