[curves] Distribution-ready optimized code

Thu Mar 19 06:15:18 PDT 2015

I really wish C were a good option here.

I was trying to get C working with vectorization hints when that's 
possible, and intrinsics/asm when it isn't.  Unfortunately, the current 
clang can't vectorize its way out of a paper bag, and GCC isn't much 
better.  So I've had to fall back to extensions and intrinsics a lot.  
But if there were some set of carefully written constructs that would 
make most of the things that need to be fast, fast, then it would lessen 
the amount of hairy, processor specific code to deal with the rest.

If anyone has tips for making fast platform-independent C code, though, 
I'm all ears.  Or a competitive numerics library for C++.

Or maybe I should try FORTRAN?

-- Mike

On 3/18/2015 10:53 PM, Samuel Neves wrote:
> Suppose you have some amazing new CPU-specific code for your favorite field, curve, key exchange, or whatever. How do
> you distribute it in a way that minimizes its user's effort to integrate it in their own applications (presumably in C
> or via some FFI interface)?
>
> As I see it, there are 4 possible approaches:
>
> 1. Distribute the assembly. This is the obvious reply, and arguably the best. Nevertheless, this option leaves something
> to be desired:
>    - ABIs / calling conventions vary between operating systems and/or languages, e.g., SysV ABI vs Windows ABI, . This
> requires either preprocessor usage or some sort of trampoline (e.g., https://github.com/floodyberry/asm-opt) to adjust
> parameters to the implemented convention.
>   - Syntaxes also vary, e.g., Intel vs AT&T x86 syntax, Plan9 assembler syntax, etc. This either requires a single
> assembler that works with all syntaxes, or distributing multiple versions of the same function.
>
> 2. Heavy preprocessor use / code generator. This is the OpenSSL approach, using Perl scripts to output suitable assembly
> for the relevant platform. Crypto++ does something similar, but abuses the C preprocessor for this instead. This
> approach is not too bad, but it easily makes the code unreadable when supporting multiple instruction sets, platforms,
> or other optionals. And may require fluency in some otherwise unnecessary language.
>
> 3. Use compiler intrinsics. This is not always practical, since some instructions do not have suitable compiler
> intrinsics to take advantage of. When it is, however, it is still problematic for anything more than prototyping:
> performance is wildly dependent on the compiler, version, and switches used. In some cases the compiler does not even
> support the intrinsics. This is OK when the user can control these, but that is not always the case.
>
> 4. Use a "smart" assembler. This is an assembler that is slightly higher level, and acts as a middle-ground between 1-2
> and 3. Besides automatic register allocation, such tools may also easily accommodate things like syntax and ABI if
> necessary. Examples of what I'm thinking here are qhasm (http://cr.yp.to/qhasm.html) or PeachPy
> (https://bitbucket.org/MDukhan/peachpy). I like this approach, but the current tools are prototypes at best, and
> therefore are not exactly suitable for distribution in their current state.
>
> So what do you guys think? Are there other options I failed to list here? Which do you like best?
>
> Best regards,
> Samuel Neves
>
> _______________________________________________
> Curves mailing list
> Curves at moderncrypto.org
> https://moderncrypto.org/mailman/listinfo/curves