[curves] Granger-Moss primes for 32 bit implementations

Wed Sep 27 12:17:51 PDT 2017

On Wed, Sep 27, 2017 at 8:01 AM, Robert Granger <robbieg.ranger at gmail.com>
wrote:

> Hi Kyle,
>
> Thanks for cc'ing me into this conversation. I'm glad you have been
> looking at the potential for vectorisation, since this is an aspect that
> has not been studied in detail yet for these primes, as far as I know.
>

I should explain what I did in a little detail then.
I flipped the order of summation for the odd terms. Instead of z_1 = (x_9 -
x_11)(y_11 - y_9) ... (x_1 - x_0)(y_0 - y_1)
This gives you a 2x vectorization of scalar vs vector z_0, z_1 = (x_1 -
(x_18,x_0))((y_18,y_0) - y_1) . . .
To get wider vectorization, you can work across the square diagonally.
16, 17 = (7,8)*9,
16,17,18 += (6,7,8)*10
16,17,18 += (5,6,7)*11
Because shuffling/permutation is relatively expensive, you can re-use the
shuffled vectors with new broadcasts:
12,13 = (5,6)*7
16,17,18 += (4,5,6)*12
12,13,14,15 += (4,5,6,7)*8
I realize now that I write this, that I could have flipped the even terms
instead, and then worked from the top down. It would probably be easier to
follow.

To get the shuffled vectors, you can either do an overlapping load/insert
and then shuffle. You can use those vectors 3 times before needing to
reload.
Intel multiplies the even slots in a widening multiply, so you can put the
values you'll need later in the odd slots

for instance to get the 3 vectors (6,7,8,9), (5,6,7,8), (4,5,6,7), you
construct the 8x32 vector
6 4 7 5 | 8 6 9 7.
The overlap is because moving values across lanes has higher latency.

You can also do a simple load and permute:
6 2 7 3 | 8 4 9 5

It's a tradeoff. For squaring, the overlapping load technique is faster.
For multiplication, because there are 2 permutes to hide some of the
latency (3 cycles) and because
of the increased register pressure to perform the overlapped load vs a
straight load, the permute is faster.

To facilitate the overlapping loads, you allocate extra room for the
coefficients and then copy the low coefficients into the high area before
starting.

>
> I just thought I should point out that the advantages you note for t near
> a power of 2, such as t = 2^26 - 2, were already detailed in the original
> paper of Chung and Hasan, "More Generalized Mersenne Numbers", Section 5.1
> (reference [13] in my paper with Moss). It is only when t is not of such a
> nice form that one should use the Montgomery representation. However, the
> multiplication we introduced should be used in this case.
>

Thanks for the pointer. I had read some of their papers, but had missed
that point.
C doesn't have to be that small, a small safety margin (I'd have to do the
math, but a few bits) less than sqrt(t) should suffice, because you have to
reduce twice anyway, and the first reduction is overkill.
For multiprecision, you want c to be smaller so that the 2nd reduction is
as few words as possible.
If you use signed coefficients, c can be positive as well, which isn't in
the Chung Hasan paper.

> All the best,
>  - Rob Granger
>
>
Thanks.

It means a lot to know I'm not wildly off the mark.
Kyle
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://moderncrypto.org/mail-archive/curves/attachments/20170927/4336e0d4/attachment.html>