[messaging] Let's run a usability study (was Useability of public-key fingerprints)

Mon Mar 10 10:19:51 PDT 2014

On 10 March 2014 09:36, Daniel Kahn Gillmor <dkg at fifthhorseman.net> wrote:
>  0) "reading aloud" comes from a video to keep a constant rate
>
> this does not match my experience in how fingerprints are compared when
> read aloud.  in practice, even if there are multiple listeners, there is
> often pushback from the listeners, asking for a pair of octets to be
> repeated, or for the reader to slow down.  a video played at constant
> speed (without the user able to control it) doesn't feel at all like the
> same interaction, let alone representative of the same cognitive
> challenge.  I'm not sure how to solve this.  maybe we just focus on the
> business card approach?

My goal of the video was to introduce consistency in the speed of the
reader.  Allowing or not allowing them to repeat could go either way.

I think I leaned away from it, because I suspect that allowing them to
repeat (and telling them such) will introduce biases towards 'Success'
because people now have a mechanism by which to assure success: make
them repeat a lot.  (It's like taking the time gate away from the time
gated approach.)  But, like I said, we can change.

I do have to say that in the times I've done read-aloud key
fingerprints... I'm slow. And I'm usually embarrassed to have them
slow down or ask them to repeat.  And I know that a single
transposition of a hex character (e.g. leaving one out) is still an
extremely computationally difficult task, so I just go with it ;)

>  1) error rate for computationally-chosen flaw
>
> the spec currently suggests we'd run the ssh fingerprint look-alike
> tool.  assuming this tool works, it seems clear that this would bias the
> results against the hex fingerprint, and toward the pseudoword or
> english word outcome, since the look-alike tool currently embeds some
> knowledge about sensitive cognitive comparisons in the hexadecimal
> output space (e.g that it is a "better match" to flip two bits in the
> first or second nybble of an octet than to flip one bit in each nybble).
>  maybe we could craft a look-alike tool for each of the different
> mechanisms, encoding whatever domain-specific knowledge we have?

Yes, absolutely.  That wasn't clear in the doc but I did intend to
create a look-alike tool for each fingerprint type.

>  2) prevalence
>
> are we planning on showing the users three fingerprints, one which
> matches exactly, one with a subtle flaw, and one with the
> computationally chosen flaw?  This would result in a mismatch prevalence
> of 67% -- far, far higher than the prevalence of actual fingerprint
> mismatches in most daily use.  a known baseline prevalence rate seems
> likely to affect the way most users think about these matches.

Hm... so you're saying that because users see 2 out of 3 mismatches on
the first couple tests, that's going to skew them towards 'Success'
later because they expect more mismatches?

Actually, thinking about it, having an obvious pattern of '2 out of
the 3 for any type are incorrect' is also likely to skew things quite
a bit.

I suppose we need to throw in more matching ones and not have a
distinguishable pattern.  (This will balloon up the tests per subject
though, which I was initially trying to minimize.)

> I'm trying to think about what the user experience of the study would be
> -- do they see the fingerprints to match in rapid succession?  are they
> embedded in some other task?  (i think the other task would be the "head
> fake" you describe; i'm not sure what that task would be, though).

Exactly. And I'm not sure what the head fake is either.  Maybe it's
something like "Do task A, then talk with Bob, which requires this
match, and Bob's going to tell you do task B and C." where tasks A, B,
and C appear to be complicated things we're interested in measuring.

-tom