# [curves] Compact square root algorithm for annoying primes

Mike Hamburg mike at shiftleft.org
Thu Jan 21 15:30:01 PST 2016

```Hello Curves,

I was porting an embedded EC library to work with other curves, and I was wondering if anyone knew a compact, fast, deterministic algorithm for square roots mod NIST p224 (or other annoying primes)?  I’m not worried about constant-time operation, since I probably have to blind anyway to reduce other side channels, but a deterministic algorithm is preferable because of possible real-time constraints.

OpenSSL’s implementation takes something like 4700 multiplications on average, which is more than a point*scalar multiply.

Dan’s paper [1] gives a variant of the Tonelli-Shanks algorithm which is fast and deterministic, requiring only 364 multiplies.  But it’s very heavy, with 1024 precomputed elements.  Reducing the table size makes the algorithm considerably slower.

I’ve tried to combine Dan’s Tonelli-Shanks variant with Sutherland’s algorithm, below.  A variant with 4 precomputed points, a 64-element discrete log table (not as big as it sounds) and 10 or so field elements in memory takes 900 multiplies worst-case (~760 average-case).

Cipolla’s algorithm is nondeterministic.

Is Pocklington the way to go?  It’s technically nondeterministic as well, but if I read [1] correctly the failure probability is 2^-96.  A straightforward implementation probably takes more than 900 multiplies, but hopefully it’s simpler and uses less memory, right?

Is there some other algorithm I’m missing?

Thanks,
— Mike

[1] http://cr.yp.to/papers/sqroot.pdf <http://cr.yp.to/papers/sqroot.pdf>

####################################################

p = 2^224 - 2^96 + 1
F = GF(p)
qnr = F(11)^(2^128-1)

# 3-element precomputed table with roots of unity
precmp = dict((
(3*2^i, qnr^2^(96-3*2^i))
for i in [2,3,4]
))

# 64-element discrete log table
tbits = 6
tlogger = qnr^2^(96-tbits)
dlog_table = dict((
(tlogger^i,2^tbits-i) for i in xrange(2^tbits)
))

def isqrt(x):
# inverse square root of x, assuming x is a QR
s = x^3
s = s^2 * x # 2^3 - 1
u = s^2^3 * s # 2^6 - 1
s = u^2^6 * u # 2^12 - 1
t = s^2^12 * s # 2^24 - 1
s = t^2^24 * t # 2^48 - 1
s = s^2^48 * s # 2^96 - 1
s = s^2^24 * t # 2^120 - 1
s = s^2^6 * u # 2^126 - 1
s = s^2 * x
# here s == x^(2^127-1)

ret = [s,s^2*x] # working powers of return value
rou = [qnr,qnr^2] # working roots of unity
stack = [95]

def leaf(n):
dlog = dlog_table[ret.pop()]>>(tbits-n)
for i in xrange(len(rou)):
for j in xrange(n):
if dlog & (1<<j): ret[i] = ret[i] * rou[i]
rou[i] = rou[i]^2

# max stack depth is 4, so 5+5 field elements in ret+rou
while len(stack):
n = stack.pop()
ret.append(ret[-1]^2^floor(n/2))

if n <= 2*tbits:
leaf(ceil(n/2))
rou.pop()
leaf(floor(n/2))
else:
rou.append(precmp[ceil(n/2)])
stack.append(floor(n/2))
stack.append(ceil(n/2))

return ret[0]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://moderncrypto.org/mail-archive/curves/attachments/20160121/bd4bfbfc/attachment.html>
```