[messaging] Can we do the Crypto Usability Study on Mechanical Turk?

Thu May 1 02:18:35 PDT 2014

Hey all,

Just to let you know, that I had lunch with a UX researcher about the
proposal. I have some thoughts and notes which I will type up this week.

2 persons should be enough for a pilot to work out kinks, and 30
people would work in an academic setting for the statistics we want.

A Mech Turk study could be complementary but it doesn't really
replace. The idea for a scientific study would be to vary one variable
at a time (the fingerprint mechanism) and a mechanical turk study
varies potentially thousands (lighting, time of day, tiredness,
screen, instructions, etc). It could be good to do the scientific
study to inform a larger mechanical turk study varying more variables.

As for so many trials, the "learning effect" is a thing, and you can
account for it statistically by varying the order the trials are
presented in per subject.

C

On Thu, May 1, 2014 at 11:00 AM, Ben Laurie <ben at links.org> wrote:
> On 1 May 2014 05:18, Joseph Bonneau <jbonneau at gmail.com> wrote:
>> [Starting a new thread on this]
>>
>> Sorry for being a week late, but I read over Tom's proposal at:
>> https://github.com/tomrittervg/crypto-usability-study
>>
>> Having spent much of the last two weeks reviewing papers for SOUPS
>> (Symposium on Usable Privacy and Security) and discussing design flaws in
>> many security usability studies, there are a few points I'm concerned about:
>>
>> *Assigning study participants to multiple treatments (e.g. having them test
>> multiple methods of fingerprint comparison) introduces a number of issues.
>> You have data points that are now correlated, so you can just average
>> performance for each treatment and compare. The statistics get vastly more
>> complicated to do correctly and your statistical power goes down. More
>> importantly, there are real external validity questions here. Users will
>> learn and be more clever and alert after seeing multiple systems. Most users
>> will only ever see one.
>>
>> *While not really stated, I'm imagining the same participant will be asked
>> to perform multiple trials for each treatment. This also introduces some
>> complexity when doing the statistical analysis, but is more manageable.
>>
>> *Also not really stated is the "base rate" of errors. If users are doing the
>> experiment multiple times, we want the base rate to be extremely low to have
>> reasonable validity. In reality, fewer than 1% of fingerprints you ever
>> compare are going to mismatch. People do an approximation of Bayesian
>> reasoning, and if their prior probability is 0.99 that the fingerprints
>> match, this is much different than in an experiment if they're mismatching
>> half the time and participants come to expect it. There are two ways around
>> this: a deception study (the "head fake" approach as Tom put it) or having
>> users do the task many times and the vast majority of fingerprints match.
>>
>> *This proposal includes 10 experimental treatments. That's a lot. If we
>> don't re-use participants, we already need 10 people just to have one person
>> try each experiment, and that's assuming the phone method doesn't take two
>> people. If we ask them all to do many dummy trials with matching
>> fingerprints to screen for errors, this is an utterly impractical study.
>>
>> Overall I think it will be nigh-impossible to do this study in-person and
>> have a sufficient sample size. I propose doing the study online using Amazon
>> Mechanical Turk (mTurk). This is now standard for psychology experiments in
>> general and security usability experiments as well. While not perfect
>> multiple studies have confirmed this is a much more representative user
>> population than in-person studies ever obtain.
>>
>> I would do the experiment as follows:
>> *For the phone comparison method, play an audio recording of somebody
>> reading the fingerprint and display it on screen for comparison. This isn't
>> perfect, as it's non-interactive, but it's a start.
>> *For the business card method, show a JPEG of a business card, and have them
>> compare to a version rendered in text.
>> *Assign each user to only 1 treatment and have many more total users.
>> Because you pay users for time, this is basically cost-neutral.
>> *Give each user 50-100 trials, and have perhaps 1-5 of them be incorrect.
>> This would best be randomized since occasionally users talk out of band
>> about studies. Have a 50-50 mix of random and targeted errors.
>>
>> Now for 10 treatments of user we, we can aim for 100 users per treatment. We
>> may want to adjust this based on some pilot studies. If the experiment takes
>> 10 minutes, this is probably about $1 per user. So we're talking $1000.
>> That's a lot, but an in-person user study would be far more.
>
> I like this plan a lot. I am curious whether its a good idea to give
> people so many trials - presumably for most people, verification is a
> relatively infrequent thing and so allowing them to become familiar
> with it seems like it would bias the results.
> _______________________________________________
> Messaging mailing list
> Messaging at moderncrypto.org
> https://moderncrypto.org/mailman/listinfo/messaging