> If (a) the users will tolerate enough latency for the MITM box to
> buffer about a word of speech, and (b) the attacker can get some
> advance information about the users' dialect, it sounds like a fun
> problem for a grad student.

I'll believe it when the "uncanny valley" is no longer a thing.

While I don't disbelieve that in the future we'll have systems that can
edit video and audio in realtime and swap out a SAS, in the meantime I
think our brains do a remarkable job of 1) identifying the voices of people
we know well even if they say just one word 2) discriminating between 3D
animations and the real thing, to the point it's disturbing to see a
near-perfect (but slightly off) recreation of a person

Now compound the innate human ability to detect this with a security
context where people are hopefully inherently skeptical a d you have a
problem that's much harder than it appears at first glance.

> My understanding is that the major part of speaker recognition is the
> 'glottal pulse' (which can easily be extracted from any voiced
> phoneme)

Reminds me of Cryptonomicon where you need an impersonator who understands
the precise timing (errors) in your particular rendition of Morse Code ;)

