[noise] Analysis of Noise KDF

Thu Apr 28 10:06:45 PDT 2016

I wrote up some analysis of Noise's current KDF, including a security
model, security arguments, and comparison to alternatives.

TLDR: Current design has good theoretical support, is similar to other
protocols, and is simple to describe and implement.  I'm not seeing an
alternative that's much better.

Security Model
===============

KDF API:
------
 - Takes a hash function of output length HASHLEN = 256 or 512 bits
 - Called with a sequence of N "input bitstrings"
 - Returns a sequence of N 256-bit "output keys"

Each input bitstring has one of these properties:
------
   - SECRET:  Unknown to the attacker.  High entropy (eg > 128 bits),
from some distribution decided at the start of the game[*].  May be
replayed.
   - FRESH:  Known to the attacker.  High entropy, from some
distribution decided at the start of the game[*].  Has not been sent
to the KDF before.
   - SECRET+FRESH:  High-entropy from the Secret distribution, but has
not been sent to the KDF before, and can only be replayed as Secret,
not Secret+Fresh.
   - MALICIOUS

([*] In practice the Secret and Secret+Fresh distributions are
"typical" DH secrets or uniform pre-shared keys, and the Fresh
distributions are "typical" DH public keys (which Noise uses to
randomize the PSK).  For generality we could allow the attacker to
choose these distributions, subject to entropy requirements.  However
we'll need to assume the distributions are independent from any
initial values for KDF variables, for "randomized extractors".)

Desired output properties:
------
 - Secret inputs to a KDF instance make its following outputs Secret
(and uniformly random).
 - Secret+Fresh inputs to a KDF instance make its following outputs
Secret+Fresh (uniform, and never output before).
 - A Fresh or Secret+Fresh input to a KDF instance sometime after a
Secret input makes its following outputs Secret+Fresh.

Security game:
------
 - Attacker runs instances of the KDF, submitting a sequence of
attacker-chosen inputs to each instance.
 - The attacker can include Secret and/or Fresh inputs anywhere.
 - The attacker can repeat any particular Secret input as much as she
wants, including in different KDF instances.
 - The attacker doesn't see the outputs.
 - The attacker can query the encryption of chosen messages with
Secret+Fresh output keys, and the encryption should be
indistinguishable from random.
 - At the end of the game, the attacker can try to distinguish a
Secret output key from random, provided neither that output key nor
later keys from the same sequence were queried for encryption
(otherwise, there would be a trivial distinguisher).

Current design and "chaining key"
==================================
We'll consider KDF designs that process their input sequence by
updating a "chaining key" (ck) and returning each output key based on
an input bitstring.  The current Noise KDF is simply a chain of:

(ck, output_key) = HKDF(salt=ck, input)

Each HKDF expands to:

HKDF(ck, input):
  temp       = HMAC(key=ck, input)
  new_ck     = HMAC(key=temp, 0x01)
  output_key = HMAC(key=temp, new_ck || 0x02)
  return (new_ck, output_key)

Security Arguments
===================

Analysis #1:  Random Oracle Model
-------------
Hash is a RO, so HMAC is too.  RO queries with a secret input give
secret output.  RO queries with a fresh input give fresh output.

Analysis #2:  ROM for first HMAC; PRF for next HMACs
-------------
The RO extracts uniform secret keys from input secrets.  PRF with a
uniform secret key produces secret values, and PRF with a secret+fresh
key, or secret key and fresh input, produces secret+fresh values.

INTERLUDE:  Random-CK assumption
-----------
Analyses 3 through 5 assume that ck is known-but-random (until a
Secret input, at which point ck becomes secret-and-random).  This is
justified because:

 (1) The initial Noise ck is based on the protocol name.  We assume
the protocol name is chosen independently from the Secret and Fresh
distributions (protocol names are chosen from a rigid process, so are
unlikely to have enough flexibility to have bad interactions with DH
outputs).

 (2) Subsequent ck values are based on input bit-strings processed
through two layers of HMAC.  We assume this is "enough" hashing that
the function from (current ck, input) -> (next ck) can be modelled as
a RO, i.e. an attacker can't "steer" ck to chosen values.

Analysis #3:  Computational extractor for first HMAC; PRF for next HMACs
-------------
The [HKDF] paper analyzes HMAC as a computational extractor, i.e. a
function that extracts uniform output from high-entropy-but-nonuniform
input distributions when given a random salt (salt=ck here, justified
by Random-CK assumption).  Obviously, such an extractor + Random-CK
would achieve the security goals.

Hugo argues this property for HMAC in a few ways (Section 6 of [HKDF]):

 (Lemma 5) If the HMAC inner hash's compression function is "almost
universal", i.e. unlikely to hash a pair of inputs to the same output
when keyed with random IV (which is a weak assumption) then the output
from the inner hash has ~same entropy as the input bit string, which
the outer hash can extract if modelled as a RO.

 (Lemma 6) If the HMAC inner hash's compression function is a
"statistical extractor" and a "Dual-PRF", i.e. a PRF when keyed
through the message, not the IV, then HMAC is a computational
extractor when all hash input blocks have high entropy, a condition
met by 25519 with all current Noise hash functions, and 448 with the
512-bit hash functions.

(Note on "Dual-PRF": The HMAC proof in [BELLARE2006] assumes the
compression function is a "Dual-PRF", i.e. a PRF when keyed either
through the message, or through the IV.  Bellare uses this to go from
NMAC -> HMAC, since the HMAC key is passed through the message input.
Dual-PRF is a reasonable assumption, since hash functions are designed
to be random if any part of the input is random, not just the IV.)

If we zoom out from HMAC (made of inner-then-outer hashes) to HKDF
(made of extract-then-expand HMACs), we realize the KDF is somewhat
fractal, and similar "almost universal + RO" and "dual-PRF" arguments
can be made for the entire HKDF (Analyses 4 through 6):

Analysis #4:  PRF for first HMAC; ROs for next HMACs
-------------
In the terminology of [BELLARE2006], PRF implies "computational
almost-universal" (cAU), i.e. it's hard to choose a pair of inputs
that PRF to the same output when the PRF is keyed randomly.  If the
first HMAC is a PRF, it is unlikely to have detectable collisions for
the Secret and Fresh distributions, with Random-CK assumption.

Thus the output and input to first HMAC have ~same entropy from
attacker's perspective, and the next two HMACs (as independent Random
Oracles) would extract uniform values.

Analysis #5:  Dual-PRF / Weak PRF for first HMAC; PRF for next HMACs
-------------
Assume HMAC (or the inner hash's compression function) is a "Weak PRF"
when keyed through the message.  A "Weak PRF" can only be queried on
random inputs, not chosen inputs, so this is weaker (i.e. better) than
a PRF assumption.

[DGKM2012], Section 7, shows that a PRF or Weak PRF can be made into a
computational extractor if it is keyed by the entropy source and
queried on random inputs.  So the first HMAC (or its inner hash),
interpreted as a Dual-PRF and Weak PRF, is a computational extractor
for the input bit-string given Random-CK.

Analysis #6: Dual-PRF (keyed with ECDH output) for first HMAC; PRF for
next HMACs
-------------
To avoid the Random-CK assumption, we can consider the Dual-PRF as a
"normal" PRF (not weak PRF, like #5).

With Noise's current ECDH functions, and the standard assumption that
the main subgroup on these curves is a "DDH subgroup" [GKR2006], the
input bit-strings are indistinguishable from random x-coordinates in
this subgroup.

Unless the hash function is checking the curve equation and group
order of its inputs (or querying a DDH oracle, if we use something
like Gap-DH for modelling DH), it's reasonable to assume the hash
function will behave the same as if these inputs were entirely random
bit strings. (In fact, a large chunk of the value _will_ be
indistinguishable from random - see [CFP2009]).

So the first HMAC (or its inner hash) can be viewed as applying a PRF
key taken directly from ECDH output.

ALTERNATIVES
=============
The first HMAC is the most interesting, because it's both extracting
from the input bitstring, and combining with ck.

Let's consider alternatives for this "HKDF-Extract" phase:

H() could be HASH(), HMAC(), or HKDF()

Current:
  temp = HMAC(ck, input)

"Prehash":
  temp = HMAC(ck, H(input))

"XOR":
  temp = ck XOR H(input)

Prehash:
 + Slightly improves Analysis #1 and #2, since extraction from input,
and combining with ck, occur in separate ROM queries.
 + Slightly improves Dual-PRF analyses, since the entropy is
RO-extracted into a key prior to the Dual-PRF.
 - However, we already have two layers of extra hashing to mitigate
hash weakness (HMAC has inner and outer hashes; HKDF has extract and
expand HMACs).  A third is probably unnecessary.

XOR:
 + Eliminates the possibility that independent ck and input could
somehow cancel the other's entropy
 - But adds the worry that secret-but-equal values _would_ cancel.  So
this is trading one concern for another, and would require careful
analysis now and with any future changes, to ensure random variables
can't be cancelled.

(In other words, if A and B are random variables, A XOR B preserves
entropy from either input - considering entropy from an attacker's
perspective - without needing assumptions about HMAC(A, B).  But A XOR
A _destroys_ entropy, which HMAC(A, A) is unlikely to do.)

 - Known ck's with a repeated secret input gives related-keys for the
following HMACs, invalidating the PRF assumption for them.

Comparison to other KDFs
---------
IPsec uses a similar chain of HKDF to rekey with new DHs (see
[RFC4306], SK_d (similar to ck), Section 2.18).

The Signal Protocol uses a similar chain of HKDF to rekey its "double
ratchet" with new DHs [DOUBLE_RATCHET].

The OPTLS and TLS 1.3 drafts use a more complicated KDF currently, but
are considering switching to a "Simplified" KDF which is just a chain
of HKDF [TLS].

Hugo's [HKDF] paper endorses this "dual use of salt" in Appendix D.6.

By relying on more widely-used constructions, we benefit from
researchers analyzing the construction:  If a hash function is
discovered to be weak, people will ask "what does this mean for HMAC",
and "what does this mean for HKDF".

Conclusion
===========
After some in-depth analysis, the current KDF still looks pretty good:
 - Simplicity of description and implementation (it's just HKDF,
chained via salt)
 - Good theoretical security arguments
 - Good heuristic security arguments (two extra layers of hashing to
mitigate weakness)
 - Similar to other designs, meaning we benefit from 3rd-party
analysis and it's easier to justify
 - No obvious alternatives that appear better

Trevor

[HKDF] https://eprint.iacr.org/2010/264.pdf
[BELLARE2006] https://cseweb.ucsd.edu/~mihir/papers/hmac-new.pdf
[DGKM2012] https://eprint.iacr.org/2011/708.pdf
[GKR2006] https://eprint.iacr.org/2004/099.pdf
[CFP2009] https://www.di.ens.fr/~fouque/pub/euro09.pdf
[RFC4306] https://tools.ietf.org/html/rfc4306
[DOUBLE_RATCHET] https://github.com/trevp/double_ratchet/wiki
[TLS] https://www.ietf.org/mail-archive/web/tls/current/msg19227.html