[messaging] Pour one out for "voice authentication"

Joseph Bonneau jbonneau at gmail.com
Thu Jan 1 11:34:08 PST 2015

A number of schemes (prominently ZRTP) involve both parties computing a
short authentication string from random both have input during key
exchange. A common idea is to have one or both parties read this string out
over the newly established audio channel to ensure there was no MITM
attack, with the assumption that a MITM cannot change or manipulate the
users' vocalization of the string to another string without the human users
noticing that the voice does not sound correct. In essence, that we cannot
effectively synthesize Alice's voice in a way that will fool Bob.

We've known this was a shaky assumption and the NSA allegedly invested in
capability to defeat these systems long ago. But last year this came up on
this list [1] and we concluded there wasn't much published on the
feasibility of this attack.

At CCS this year a paper came out by Shirvanian and Saxena with some
initial user experiments (
http://www.sheridanprinting.com/CCS2014ep_11_4/ccs/p868.pdf) based on
"voice morphing" which has been a research interest outside the security
community for a while and on which much has been published. The basic idea
of voice morphing is this: Mallory obtains some samples of Alice speaking.
This can be arbitrary text but they also have specific training texts which
produce better results. Typically only a few minutes are needed. Mallory
then, in her own natural speaking voice, reads the same text. The voice
morphing software is then trained on the difference between these samples,
and comes up with a "delta" that it has to apply to Mallory's sample to
make it as close as possible to Alice's sample.

Later, Mallory can submit audio of herself speaking any other utterance,
and the voice morphing engine can apply this same delta to produce
plausible audio of Alice speaking this same utterance. This is not science
fiction, there are working open-source implementations available such as
Festvox [2], used in the CCS paper. There are many limitations still-such
as matching emotional inflection and emphasis. But it turns out that for
reading a fingerprint, which is a fairly constrained subset of speech (no
emotion, even pacing), this already works very well.

Shirvanian/Saxena's experiments found that untrained human subjects had no
more capacity to distinguish between a "true" voice and a morphed voice
than between a "true" voice and a "true" voice with different background
noise. They also point out some other attack avenues-if you get enough of
Alice's audio (for example, if you're reading a hex fingerprint) you don't
even have to synthesize, just re-order samples you already have. And often
in one direction you only need to synthesize "yes" or "looks good" if both
parties don't read the fingerprint.

This paper is a bit short of a working attack. For example, they tested
human recognition of voices they had just heard for the first time. Humans
might be better at recognizing long-term social contacts. And you need
considerable engineering to figure out where to splice in your synthesized
audio fingerprints in real-time. On the contrary though, the users in this
study were primed to be suspicious about the audio. In real use, people
will probably assume it's fine and not be paying much attention.

One could also argue there is still *some* value to voice authentication
because not every attacker will have this capability, and it might not be
able to fool 100% of users 100% of the time. However, the open-source
nature of this tool means we should assume some enterprising individuals
will sell it in a turnkey middlebox to intelligence agencies around the
world soon enough. Furthermore voice authentication has a high opportunity
cost, because it inherently requires user activity and cognition for which
there is always a very low budget. So I don't think it can be justified on
"defense-in-depth" grounds.

Conclusion: I firmly believe that voice authentication is broken and should
not be deployed in new systems. Synthesis and morphing will only get
better, while human auditory processing will not. There might be some
creative new designs to make humans read speech which is harder to
synthesize, but this is probably a losing battle.

[1] https://www.mail-archive.com/messaging@moderncrypto.org/msg00036.html
[2] http://festvox.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://moderncrypto.org/mail-archive/messaging/attachments/20150101/18805b3b/attachment.html>

More information about the Messaging mailing list