[curves] Distribution-ready optimized code

Wed Mar 18 22:53:50 PDT 2015

Suppose you have some amazing new CPU-specific code for your favorite field, curve, key exchange, or whatever. How do
you distribute it in a way that minimizes its user's effort to integrate it in their own applications (presumably in C
or via some FFI interface)?

As I see it, there are 4 possible approaches:

1. Distribute the assembly. This is the obvious reply, and arguably the best. Nevertheless, this option leaves something
to be desired:
  - ABIs / calling conventions vary between operating systems and/or languages, e.g., SysV ABI vs Windows ABI, . This
requires either preprocessor usage or some sort of trampoline (e.g., https://github.com/floodyberry/asm-opt) to adjust
parameters to the implemented convention.
 - Syntaxes also vary, e.g., Intel vs AT&T x86 syntax, Plan9 assembler syntax, etc. This either requires a single
assembler that works with all syntaxes, or distributing multiple versions of the same function.

2. Heavy preprocessor use / code generator. This is the OpenSSL approach, using Perl scripts to output suitable assembly
for the relevant platform. Crypto++ does something similar, but abuses the C preprocessor for this instead. This
approach is not too bad, but it easily makes the code unreadable when supporting multiple instruction sets, platforms,
or other optionals. And may require fluency in some otherwise unnecessary language.

3. Use compiler intrinsics. This is not always practical, since some instructions do not have suitable compiler
intrinsics to take advantage of. When it is, however, it is still problematic for anything more than prototyping:
performance is wildly dependent on the compiler, version, and switches used. In some cases the compiler does not even
support the intrinsics. This is OK when the user can control these, but that is not always the case.

4. Use a "smart" assembler. This is an assembler that is slightly higher level, and acts as a middle-ground between 1-2
and 3. Besides automatic register allocation, such tools may also easily accommodate things like syntax and ABI if
necessary. Examples of what I'm thinking here are qhasm (http://cr.yp.to/qhasm.html) or PeachPy
(https://bitbucket.org/MDukhan/peachpy). I like this approach, but the current tools are prototypes at best, and
therefore are not exactly suitable for distribution in their current state.

So what do you guys think? Are there other options I failed to list here? Which do you like best?

Best regards,
Samuel Neves