[messaging] Can we do the Crypto Usability Study on Mechanical Turk?

Joseph Bonneau jbonneau at gmail.com
Thu May 1 07:02:40 PDT 2014

> 2 persons should be enough for a pilot to work out kinks, and 30
> people would work in an academic setting for the statistics we want.

(a) Could you present a sketch of the experimental design? Exactly what
would each participant be asked to do step-by-step?
 (b) Given that could you calculate a statistical power argument for why
N=30 is satisfactory given the effect size you're hypothesizing exists?

Putting external validity concerns aside I'm very skeptical that N=30 is
close to enough.

A Mech Turk study could be complementary but it doesn't really
> replace. The idea for a scientific study would be to vary one variable
> at a time (the fingerprint mechanism) and a mechanical turk study
> varies potentially thousands (lighting, time of day, tiredness,
> screen, instructions, etc).

I disagree strongly. Several of the factors you mentioned (tiredness, time
of day) are going to vary regardless of an in-person vs. online study.
These factors are addressed by randomly assigning participants to
treatments and doing statistical testing to control for variance. Larger
sample size is the only real cure. Since they will vary in real life
(people don't only check fingerprints at 11 AM on a 17-inch screen), you're
losing external validity by fixing them, even if it may slightly reduce
experimental noise.

The instructions given are strictly easier to control in an online study
and you largely eliminate the potential for observer-expectancy effects.

Most importantly with an online study you'll get a much more diverse
participant pool, and they'll do your task in the middle of tons of other
computer work in their own home, which is far closer to reality than
driving into a lab and sitting down to do research.

It could be good to do the scientific
> study to inform a larger mechanical turk study varying more variables.
> As for so many trials, the "learning effect" is a thing, and you can
> account for it statistically by varying the order the trials are
> presented in per subject.

Yes, but with N=30 you don't have nearly enough people to present every
possible permutation of 10 treatments, so you will be introducing potential
error. Also, this approach further weakens validity: In real life people
never "learn" how to compare fingerprints by trying 10 different ways to do
it in a row.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://moderncrypto.org/mail-archive/messaging/attachments/20140501/2d6a3e4a/attachment.html>

More information about the Messaging mailing list