[curves] The great debate over point formats

Samuel Neves sneves at dei.uc.pt
Fri Jan 31 01:00:38 PST 2014


On 31-01-2014 06:45, Robert Ransom wrote:
>> Additionally, we considered not
>> > only vector or hardware implementations, but also the fast integer
>> > multipliers already available to software implementations in many
>> > platforms. Of course, these could require specialized assembly-language
>> > multipliers for optimal performance. You can find some brief notes below.
> You seem to be assuming that (a) the implementor can use (an
> equivalent of) the Intel ADC instruction, and (b) ADC is fast.  (a)
> does not hold for C implementations; according to the Ed25519 paper,
> (b) does not hold for whichever of Intel's AMD64 chips was newest at
> that time.


C implementations can either use __uint128_t or use a GMP-style
longlong.h, which contains a handful of useful inline assembly macros
for a number of architectures. This is not as good as intrinsics, but
it's better than ignoring the carry flag.

AMD's K10 has long been the gold standard when it comes to integer
arithmetic. It can perform 3 ADCs per cycle, with a latency of 1 cycle.
Intel * Bridge is not so bad, at 2 cycles of latency (1 if the second
operand is 0, i.e. ADC R, 0). You can see in eBATS that the speed
difference between amd64-51 and amd64-64 is less than 1% on Sandy Bridge
(amd64-51 is 50% worse on K10), so it's not as big of a deal as it was
with previous microarchitectures.

One of the most annoying things about ADC is that the MUL instruction
ruins the flags, which ends up resulting in long dependency chains and
therefore higher latency. Haswell's response to this was the
introduction of MULX, which is a MUL that does not affect the flags.
This lets you hide the cost of some ADC behind the latency of MULX.
Well-tuned Haswell code can beat the K10.

Intel is also adding 2 "carry" flags in the future, so that you can hide
latencies even further. On the other hand, there will also be 512-bit
vector units capable of performing 16 32x32->64 multiplications per
cycle, so perhaps taking vectorization ease into account is warranted.




More information about the Curves mailing list