[messaging] Can we do the Crypto Usability Study on Mechanical Turk?

Thu May 1 07:51:12 PDT 2014

yes I will be writing up my notes as I mentioned. directly on the
github. These are from a professional UX research on experimental
design for an academic study.

c

On Thu, May 1, 2014 at 4:02 PM, Joseph Bonneau <jbonneau at gmail.com> wrote:
>> 2 persons should be enough for a pilot to work out kinks, and 30
>> people would work in an academic setting for the statistics we want.
>
>
> (a) Could you present a sketch of the experimental design? Exactly what
> would each participant be asked to do step-by-step?
> (b) Given that could you calculate a statistical power argument for why N=30
> is satisfactory given the effect size you're hypothesizing exists?
>
> Putting external validity concerns aside I'm very skeptical that N=30 is
> close to enough.
>
>> A Mech Turk study could be complementary but it doesn't really
>> replace. The idea for a scientific study would be to vary one variable
>> at a time (the fingerprint mechanism) and a mechanical turk study
>> varies potentially thousands (lighting, time of day, tiredness,
>> screen, instructions, etc).
>
>
> I disagree strongly. Several of the factors you mentioned (tiredness, time
> of day) are going to vary regardless of an in-person vs. online study. These
> factors are addressed by randomly assigning participants to treatments and
> doing statistical testing to control for variance. Larger sample size is the
> only real cure. Since they will vary in real life (people don't only check
> fingerprints at 11 AM on a 17-inch screen), you're losing external validity
> by fixing them, even if it may slightly reduce experimental noise.
>
> The instructions given are strictly easier to control in an online study and
> you largely eliminate the potential for observer-expectancy effects.
>
> Most importantly with an online study you'll get a much more diverse
> participant pool, and they'll do your task in the middle of tons of other
> computer work in their own home, which is far closer to reality than driving
> into a lab and sitting down to do research.
>
>> It could be good to do the scientific
>> study to inform a larger mechanical turk study varying more variables.
>>
>> As for so many trials, the "learning effect" is a thing, and you can
>> account for it statistically by varying the order the trials are
>> presented in per subject.
>
>
> Yes, but with N=30 you don't have nearly enough people to present every
> possible permutation of 10 treatments, so you will be introducing potential
> error. Also, this approach further weakens validity: In real life people
> never "learn" how to compare fingerprints by trying 10 different ways to do
> it in a row.