[messaging] Let's run a usability study (was Useability of public-key fingerprints)

Mon Apr 7 10:36:49 PDT 2014

On Mon, Apr 7, 2014 at 8:48 AM, Tom Ritter <tom at ritter.vg> wrote:
> On 27 March 2014 21:24, Trevor Perrin <trevp at trevp.net> wrote:
>>>  - We have two participants speaking fingerprints aloud to each other.
>>> Do we want them to do it over a cell phone to add difficulty, or just
>>> omit that bit?
>>
>> I think speaking over a phone is a good case to test, because in
>> person you could also use QR codes, or just look at each other's
>> screens.  A landline might provide more consistent voice quality than
>> cellphone.
>
> It's true two tests over cellphones may get different voice quality.
> I tend to think the difference between tests would in this case be
> okay, because it's a real world factor people ahve to deal with... but
> I could go either way.

My concern is that if you do a bunch of base32 tests one day and
sentence tests the next, differences in the phones used or cell
reception might bias the results.

Landlines might make it easier to provide a constant voice-quality
experience.  Alternatively, if you mixed the tests across days and
across different cellphones, maybe you could argue that the
voice-quality variability is the same for everything?, though that
seems more complicated.

Perhaps just flag this debate, and Christine's UI researchers could
decide, since I think this is getting close to the point at which
professionals should be involved.

>> Other:
>>
>>  *  I think a printed biz card may not work well w/high-resolution
>> visual fingerprints, so maybe doing it on a small phone screen is
>> better?
>
> I actually wasn't planning on doing a printed visual fingerprint... I
> suppose I could though.  I don't think the phone screen would work,
> because it's not terribly common to have someone's fingerprint on your
> phone and then try to verify it on your desktop.
>
>>  * When comparing aloud, I would suggest also having a time limit
>> designed to provoke a fairly high error rate, so the same methodology
>> is applied to all modes of use.  Otherwise, there's two variables in
>> the read-aloud case (time, error rate), so not as easy to compare
>> different formats.  Having the tester record "how many times the
>> participants asks the other to repeat the last token, slow down, or
>> otherwise change how they're reciting it" seems subjective and
>> unnecessary.
>
> I added in the time limit. I don't think it would be subjective (it
> seems pretty clear to me that if someone asks for a repeat we record
> it, if someone asks for them to slow down, we record it, etc).

What if I say "uhhh...ok" to confirm I've heard the last group, but
also subtly slow-down the rate at which you're saying things?  What if
I say "that was an 'A', right?"  What if I repeat every group after
you say it?

>  As for
> as necessity, we could of course not capture that information, but I
> feel like it's relevant.  For example: if we get successful results
> for English words with no repetitions, but successful results on
> pseudowords with tons of repetitions - are english words not better?

I'd argue that success rate on the fixed-time distinguishing test is a
better metric.

I.e. if given 12 seconds to talk, people can distinguish
matching/not-matching pseudowords 80% of the time, but only 30% of
time for English words, that seems more meaningful, even if more
repetitions / interactivity are used for the pseudowords.

>>  * Should there be a handwritten test, i.e. one user handwrites the
>> fingerprint for the other?
>
> I don't think it's worth the additional testing...

It may be too much given resources, but perhaps worth mentioning as
another test which might bring different things to light.

Trevor