[messaging] Word list for English Words

Michael Hamburg mike at shiftleft.org
Wed Jun 4 13:37:56 PDT 2014


A while ago, I tried to make a word list by compiling several other lists (Wikipedia, Google frequency analysis, COCA, 12dicts).  I limited it to 6 letters and under.  I never really made a finished product, but I made a few alternatives with different trade-offs.  We’d need to remove potentially offensive words, rare words which slipped through, slang, jargon, and celebrity names.  Here are some samples from a 3470-word list (attached).  11.76 bit / word, 11 words.

per - trip - system - inside - remark - wife - drum - fringe - pulp - japan - atomic

seal - boiler - scream - fault - earl - studio - weird - former - effect - number - skull

eddy - sperm - jam - jobs - flank - coffin - say - salad - time - canton - sonic

craft - plaza - dog - crisis - pepper - autumn - crew - wreck - paper - scarce - metro

bolt - earn - cinema - pep - cash - breed - than - fury - pupil - bad - tribal

feast - vase - begin - mortal - hamlet - java - hogan - sonic - escort - reform - veto

fruit - fund - off - alone - draw - glue - crime - barley - guy - pact - expand

golf - wage - alloy - sore - serial - gig - amount - exact - comet - plea - garb

insert - arrest - fee - tap - rand - dame - wrong - fork - dance - shear - noon

cam - acute - must - seldom - strong - weigh - bunker - port - brill - sang - aside

I think with a little more work this could be a good list at least for printed fingerprints.  It has many more words than basic English or Mnemonicode, but the words are shorter and simpler than Mnemonicode.  However, you can see that maybe 5% of the words are questionable, in particular, “hogan”, “sperm”, “canton”, “brill” and “rand”.  

This list makes no attempt to be acoustically distinct.  That’s a major advantage of the PGP wordlist.  Also, for an audio-focused word list, we might want to clamp on syllables instead of on letters.

Cheers,
— Mike

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: rough.txt
URL: <http://moderncrypto.org/mail-archive/messaging/attachments/20140604/70e244a5/attachment.txt>
-------------- next part --------------

On Jun 4, 2014, at 12:34 PM, Trevor Perrin <trevp at trevp.net> wrote:

> On Wed, Jun 4, 2014 at 6:00 AM, Tom Ritter <tom at ritter.vg> wrote:
>> I ran across the PGP Word List, which was specifically engineered for
>> communication over a voice channel.  It seems this would be a better choice
>> than using a random english list of words or the Diceware list.
>> 
>> It would make the fingerprint significantly longer however.  20 words,
>> instead of 8 to 10.
> 
> Why not 128-bit fingerprints?
> 
> 
>> Obviously there's a whole slew of tests that could be done comparing word
>> lists and english word fingerprint lengths...  But since we don't (I don't)
>> have time to come up with 65,000 english words that fit the PGP Word List's
>> criteria, I feel our options are an existing word list (shorter fingerprint)
>> or the PGP World List (longer).
> 
> Below are examples of 128 bits encoded as (PGP words, Basic English,
> Mnemonicode, Diceware, and "random words" from a large dictionary).
> Code at [1].
> 
> My sense is there's a sweet spot in the 10-12 bit dictionary range.
> Smaller dictionaries produce awkwardly-long fingerprints (PGP), and
> larger dictionaries get into weird words.
> 
> So I'd go with Basic English or Mnemonicode.  If you're comparing
> against Michael Rogers' poems, I think that uses Basic English, so
> perhaps it makes sense to use that to reduce the number of variables.
> 
> 
> PGP (8 bits/word, 16 words)
> 
> concert - sociable - shamrock - reproduce - repay - outfielder -
> cranky - Norwegian - tactics - unravel - reward - decadence - orca -
> antenna - sweatband - consensus
> 
> transit - direction - dreadful - aggregate - eating - typewriter -
> prefer - positive - showgirl - article - endow - molecule - drifter -
> inception - scorecard - warranty
> 
> spigot - processor - breadline - fortitude - acme - pyramid - crumpled
> - fascinate - beeswax - underfoot - sweatband - unify - assume -
> maritime - music - potato
> 
> nightbird - amulet - klaxon - crossover - briefcase - article -
> keyboard - determine - gazelle - penetrate - transit - Orlando -
> regain - Chicago - shadow - underfoot
> 
> stapler - Pacific - topmost - chambermaid - absurd - enterprise -
> flatfoot - megaton - tunnel - belowground - payday - Jupiter -
> spheroid - gadgetry - befriend - Eskimo
> 
> 
> BASIC ENGLISH (9.7 bits/word, 13 words)
> 
> regular - disease - see - complex - ink - but - top - shake - eye -
> other - forward - strong - payment
> 
> goat - flat - walk - country - cruel - white - manager - food - voice
> - peace - farm - till - comb
> 
> front - insect - need - play - authority - rod - letter - school -
> slip - winter - parcel - wing - roll
> 
> father - opinion - fruit - off - he - cheap - mine - common - month -
> of - record - name - humor
> 
> look - comb - net - young - thunder - rat - machine - slow - degree -
> station - stocking - knot - wind
> 
> 
> MNEMONICODE (~10.7 bits/word, 12 words)
> 
> cuba - edgar - method - fortune - amazon - aspirin - demo - winter -
> shave - donor - russian - mineral
> 
> octopus - toga - social - declare - special - micro - gemini - indigo
> - artist - justin - gibson - fidel
> 
> chicago - cave - alice - scorpio - totem - dynamic - degree - manual -
> bombay - forbid - jessica - ceramic
> 
> pioneer - dynasty - organic - plume - people - zipper - caramel -
> english - flood - scuba - ethnic - animal
> 
> coral - shelf - project - arthur - poem - news - saturn - axiom -
> trumpet - yoga - mission - portal
> 
> 
> DICEWARE (~13 bits/word, 10 words)
> 
> ns - mill - tacoma - penna - slim - mung - 8000 - hu - camber - olav
> 
> puerto - qualm - crag - ripley - bogey - skid - nee - pad - cahill - crowd
> 
> sachs - ate - anglo - fugue - arab - denny - might - atlas - cpa - ks
> 
> blip - bulge - jill - joule - scion - sk - squad - alma - micky - alan
> 
> dizzy - oslo - sf - incest - goofy - hater - y's - chopin - getty - tung
> 
> exult - wish - skin - envoy - jenny - basin - apron - gulp - roomy - veer
> 
> 
> RANDOM WORD GENERATOR AT WORDREFERENCE.COM (~16.5 bits/word, 8 words) [2]
> 
> cowhage - ekasilicon - democratist - clum - dyslexia - farfetched -
> furrier - mangosteen
> 
> matric - beadsman - enterlace - oarswoman - secretitious - incisor -
> danite - linstock
> 
> potash - intersert - possum - verbarfunambulo - additionally -
> enterotome - turrethead - telegrammic - clupeid
> 
> 
> Trevor
> 
> 
> [1] https://github.com/trevp/keyname
> [2] http://www.wordreference.com/random/definition
> _______________________________________________
> Messaging mailing list
> Messaging at moderncrypto.org
> https://moderncrypto.org/mailman/listinfo/messaging



More information about the Messaging mailing list