[messaging] Can we do the Crypto Usability Study on Mechanical Turk?

Joseph Bonneau jbonneau at gmail.com
Wed Apr 30 21:18:18 PDT 2014

[Starting a new thread on this]

Sorry for being a week late, but I read over Tom's proposal at:

Having spent much of the last two weeks reviewing papers for SOUPS
(Symposium on Usable Privacy and Security) and discussing design flaws in
many security usability studies, there are a few points I'm concerned about:

*Assigning study participants to multiple treatments (e.g. having them test
multiple methods of fingerprint comparison) introduces a number of issues.
You have data points that are now correlated, so you can just average
performance for each treatment and compare. The statistics get vastly more
complicated to do correctly and your statistical power goes down. More
importantly, there are real external validity questions here. Users will
learn and be more clever and alert after seeing multiple systems. Most
users will only ever see one.

*While not really stated, I'm imagining the same participant will be asked
to perform multiple trials for each treatment. This also introduces some
complexity when doing the statistical analysis, but is more manageable.

*Also not really stated is the "base rate" of errors. If users are doing
the experiment multiple times, we want the base rate to be extremely low to
have reasonable validity. In reality, fewer than 1% of fingerprints you
ever compare are going to mismatch. People do an approximation of Bayesian
reasoning, and if their prior probability is 0.99 that the fingerprints
match, this is much different than in an experiment if they're mismatching
half the time and participants come to expect it. There are two ways around
this: a deception study (the "head fake" approach as Tom put it) or having
users do the task many times and the vast majority of fingerprints match.

*This proposal includes 10 experimental treatments. That's a lot. If we
don't re-use participants, we already need 10 people just to have one
person try each experiment, and that's assuming the phone method doesn't
take two people. If we ask them all to do many dummy trials with matching
fingerprints to screen for errors, this is an utterly impractical study.

Overall I think it will be nigh-impossible to do this study in-person and
have a sufficient sample size. I propose doing the study online using
Amazon Mechanical Turk (mTurk). This is now standard for psychology
experiments in general and security usability experiments as well. While
not perfect multiple studies have confirmed this is a much more
representative user population than in-person studies ever obtain.

I would do the experiment as follows:
*For the phone comparison method, play an audio recording of somebody
reading the fingerprint and display it on screen for comparison. This isn't
perfect, as it's non-interactive, but it's a start.
*For the business card method, show a JPEG of a business card, and have
them compare to a version rendered in text.
*Assign each user to only 1 treatment and have many more total users.
Because you pay users for time, this is basically cost-neutral.
*Give each user 50-100 trials, and have perhaps 1-5 of them be incorrect.
This would best be randomized since occasionally users talk out of band
about studies. Have a 50-50 mix of random and targeted errors.

Now for 10 treatments of user we, we can aim for 100 users per treatment.
We may want to adjust this based on some pilot studies. If the experiment
takes 10 minutes, this is probably about $1 per user. So we're talking
$1000. That's a lot, but an in-person user study would be far more.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://moderncrypto.org/mail-archive/messaging/attachments/20140501/d13a0edd/attachment.html>

More information about the Messaging mailing list