[messaging] Can we do the Crypto Usability Study on Mechanical Turk?

Thu May 1 02:00:30 PDT 2014

On 1 May 2014 05:18, Joseph Bonneau <jbonneau at gmail.com> wrote:
> [Starting a new thread on this]
>
> Sorry for being a week late, but I read over Tom's proposal at:
> https://github.com/tomrittervg/crypto-usability-study
>
> Having spent much of the last two weeks reviewing papers for SOUPS
> (Symposium on Usable Privacy and Security) and discussing design flaws in
> many security usability studies, there are a few points I'm concerned about:
>
> *Assigning study participants to multiple treatments (e.g. having them test
> multiple methods of fingerprint comparison) introduces a number of issues.
> You have data points that are now correlated, so you can just average
> performance for each treatment and compare. The statistics get vastly more
> complicated to do correctly and your statistical power goes down. More
> importantly, there are real external validity questions here. Users will
> learn and be more clever and alert after seeing multiple systems. Most users
> will only ever see one.
>
> *While not really stated, I'm imagining the same participant will be asked
> to perform multiple trials for each treatment. This also introduces some
> complexity when doing the statistical analysis, but is more manageable.
>
> *Also not really stated is the "base rate" of errors. If users are doing the
> experiment multiple times, we want the base rate to be extremely low to have
> reasonable validity. In reality, fewer than 1% of fingerprints you ever
> compare are going to mismatch. People do an approximation of Bayesian
> reasoning, and if their prior probability is 0.99 that the fingerprints
> match, this is much different than in an experiment if they're mismatching
> half the time and participants come to expect it. There are two ways around
> this: a deception study (the "head fake" approach as Tom put it) or having
> users do the task many times and the vast majority of fingerprints match.
>
> *This proposal includes 10 experimental treatments. That's a lot. If we
> don't re-use participants, we already need 10 people just to have one person
> try each experiment, and that's assuming the phone method doesn't take two
> people. If we ask them all to do many dummy trials with matching
> fingerprints to screen for errors, this is an utterly impractical study.
>
> Overall I think it will be nigh-impossible to do this study in-person and
> have a sufficient sample size. I propose doing the study online using Amazon
> Mechanical Turk (mTurk). This is now standard for psychology experiments in
> general and security usability experiments as well. While not perfect
> multiple studies have confirmed this is a much more representative user
> population than in-person studies ever obtain.
>
> I would do the experiment as follows:
> *For the phone comparison method, play an audio recording of somebody
> reading the fingerprint and display it on screen for comparison. This isn't
> perfect, as it's non-interactive, but it's a start.
> *For the business card method, show a JPEG of a business card, and have them
> compare to a version rendered in text.
> *Assign each user to only 1 treatment and have many more total users.
> Because you pay users for time, this is basically cost-neutral.
> *Give each user 50-100 trials, and have perhaps 1-5 of them be incorrect.
> This would best be randomized since occasionally users talk out of band
> about studies. Have a 50-50 mix of random and targeted errors.
>
> Now for 10 treatments of user we, we can aim for 100 users per treatment. We
> may want to adjust this based on some pilot studies. If the experiment takes
> 10 minutes, this is probably about $1 per user. So we're talking $1000.
> That's a lot, but an in-person user study would be far more.

I like this plan a lot. I am curious whether its a good idea to give
people so many trials - presumably for most people, verification is a
relatively infrequent thing and so allowing them to become familiar
with it seems like it would bias the results.