tom at ritter.vg
Thu Aug 21 14:33:39 PDT 2014
On 21 August 2014 00:47, Tony Arcieri <bascule at gmail.com> wrote:
> On Wed, Aug 20, 2014 at 7:54 AM, Tom Ritter <tom at ritter.vg> wrote:
>> I have strong doubts about accelerometer-based audio pickup in
>> real-world settings.
Yup, that's what I was talking about. Specifically things like:
* As the sampling rate of the gyroscope is limited, one
cannot fully reconstruct a comprehensible speech from
measurements of a single gyroscope. Therefore, we re-
sort to automatic speech recognition.
* We extract fea-
tures from the gyroscope measurements using various
signal processing methods and train machine learning al-
gorithms for recognition. We achieve about 50% success
rate for speaker identification from a set of 10 speakers.
* Our setup consisted of a set of loudspeakers that included
a sub-woofer and two tweeters (depicted in Figure 5).
The sub-woofer was particularly important for experi-
menting with low-frequency tones below 200 Hz. The
playback was done at volume of approximately 75 dB to
obtain as high SNR as possible for our experiments. This
means that for more restrictive attack scenarios (farther
source, lower volume) there will be a need to handle low
SNR, perhaps by filtering out the noise or applying some
other preprocessing for emphasizing the speech signal.
* Due to the low sampling frequency of the gyro, a recog-
nition of speaker-independent general speech would be
an ambitious long-term task.
* Therefore, in this work we
set out to recognize speech of a limited dictionary, the
recognition of which would still leak substantial private
information. For this work we chose to focus on the
digits dictionary, which includes the words: zero, one,
two..., nine, and "oh". Recognition of such words would
enable an attacker to eavesdrop on private information,
such as credit card numbers, telephone numbers, social
security numbers and the like. This information may be
eavesdropped when the victim speaks over or next to the
* This is a subset of a corpus published
in . It includes speech of isolated digits, i.e., 11
words per speaker where each speaker recorded each
word twice. There are 10 speakers (5 female and 5 male).
In total, there are 10 x 11 x 2 = 220 recordings. The cor-
pus is digitized at 20 kHz.
When doing this type of research, there's things that can be solved
with more work, and there's things that turn out to just be
limitations, beyond which you're really interpreting data in unfounded
ways. (Like 'Zoom and Enhance' on digital images.) Their work is
great, and I think it's important to push the limits of what can be
done with it to understand the attack surface - just like we're doing
with RF retroreflectors.
But I have my doubts that this would could be extended so far as to
enable really good audio pickup from a phone in someone's pocket with
ambient noises around. Similarly, if someone pointed at this research
and said "My laptop has a accelerometer for turning the hard drive
off! I removed the microphone, but am I still vulnerable via this
specific vector!" - I would seriously doubt it.
More information about the Messaging