[messaging] Short Auth Strings
Daniel Kahn Gillmor
dkg at fifthhorseman.net
Fri Jan 31 11:17:39 PST 2014
On 01/31/2014 12:24 PM, Trevor Perrin wrote:
> In practice, SAS are mostly used by phone
> protocols, since users can speak the SAS to each other (assuming voice
> impersonation is hard).
Do we have backing for the assumption that "voice impersonation is
hard"? This assumption seems like the Achilles heel of these schemes,
and i wonder how much work has been done to test it.
I see three factors that suggest that the signal-processing work for
such an attack is likely to be within the realm of feasibility for
* most users have extremely low expectations of audio quality for
synchronous voice chat
* there is a large corpus of example conversations of any given user's
voice that are presumably available to a powerful adversary (e.g.
consider the data available to the operators of google voice)
* the words used are not natural to conversation , so stilted or
awkward phrasing is likely to be accepted by the recipient.
Indeed, Wikipedia suggests that the NSA has built systems to attack this
problem 8 years ago:
citing "Cryptologic Quarterly, Volume 26, Number 4", which i don't have
The argument in the ZRTP specification  for why this attack isn't a
risk is the following:
> Some questions have been raised about voice spoofing during the short
> authentication string (SAS) comparison. But it is a mistake to think
> this is simply an exercise in voice impersonation (perhaps this could
> be called the "Rich Little" attack). Although there are digital
> signal processing techniques for changing a person's voice, that does
> not mean a MiTM attacker can safely break into a phone conversation
> and inject his own SAS at just the right moment. He doesn't know
> exactly when or in what manner the users will choose to read aloud
> the SAS, or in what context they will bring it up or say it, or even
> which of the two speakers will say it, or if indeed they both will
> say it. In addition, some methods of rendering the SAS involve using
> a list of words such as the PGP word list[Juola2], in a manner
> analogous to how pilots use the NATO phonetic alphabet to convey
> information. This can make it even more complicated for the
> attacker, because these words can be worked into the conversation in
> unpredictable ways. If the session also includes video (an
> increasingly common usage scenario), the MiTM may be further deterred
> by the difficulty of making the lips sync with the voice-spoofed SAS.
> The PGP word list is designed to make each word phonetically
> distinct, which also tends to create distinctive lip movements.
> Remember that the attacker places a very high value on not being
> detected, and if he makes a mistake, he doesn't get to do it over.
There is already automated work done on phrasing analysis that is
sufficient to decrypt some VBR voice streams:
so i wouldn't be surprised if analysis of cleartext streams (given large
corpuses) could do reasonable guesswork about where to place the special
If the effectiveness of the authentication mechanism relies on either
(a) high-enough quality video such that out-of-sync lips are noticeable
or (b) clever humans smoothly working words like "miser" and
"retraction" into conversation, can we consider this protocol to be much
better than fingerprint exchange (which humans are also bad at)?
i'd be curious to know what other people's intuitions are on this. I
haven't done signal-processing work in years, and i know very little
about speech timing analysis mechanisms.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 1010 bytes
Desc: OpenPGP digital signature
More information about the Messaging