[messaging] PIR (in Pynchon Gate)

Sun Mar 23 23:48:40 PDT 2014

On 3/23/14 4:59 PM, Trevor Perrin wrote:

> My vague understanding of PIR is that "single-server" schemes are less
> practical than just sending the whole database, but there are
> "multi-server" schemes which are somewhat-efficient and secure as long
> as all servers don't collude. (Is that right? Could anyone explain PIR
> in a separate thread?)

I can explain the multi-server PIR scheme that the late Len Sassaman
created for Pynchon Gate[1] (his anonymous-remailer mailbox scheme).
This was designed to handle the "last mile" of a message addressed to a
specific pseudonym, giving recipients a way to retrieve their messages
from a mailbox without revealing to the server which messages they were
interested in.

Email lands on a "collator" box, which manages a large ring of slots and
hashes H(pseudonym+date) to decide into which slot the message should be
placed. If that slot is full, the message goes into the following one,
like the usual tricks used for hash tables. At the end of each day, the
collator has a large dataset, containing both full slots and empty ones.
If there are a million slots, each of which contains a 1kB message, the
dataset will be 1GB in size.

The collator then sends this dataset out to a moderate number of
"distributor" hosts. It's sending the same thing to everybody, so
BitTorrent is an ideal distribution mechanism. Each day, a new dataset
is distributed.

A recipient figures out which slots they're interested in for a given
day's dataset (both H(nym+date) and a handful following it), then for
each slot, they compute some number of million-bit vectors, one for each
distributor they're going to use. (you must use at least two, and the
unlinkability holds if at least one of the distributors you use does not
collude with the rest). All but the last vector is randomly generated.
The final vector is created by XORing all the other strings together,
and then flipping the bit of the one slot you're interested in. The
recipient sends one vector to each distributor.

The distributor computes the XOR of all slots that have a 1-bit in the
vector, and returns the result (equal in size to a single slot, e.g.
1kB). The recipient XORs together all of the results to get just the
desired slot contents, since everything else XORs into zeros.

If you don't need information-theoretic security, you can send PRNG
seeds for all but one of the vectors, saving a lot of upstream
bandwidth. If you use three distributors, the recipient upstream
bandwidth is roughly 1Mbit*NUMSLOTS per day, and the downstream
bandwidth is NUMSLOTS*SLOTSIZE. But the distributors are pushing 1GB up
and down each day.

For IO efficiency, the collators should behave like clock dials,
continuously sweeping through the slots and accumulating responses for
any pending requests, to avoid duplicating reads.

There were also a bunch of fiddly bits involving how to scale the
slot/ring/window sizes, how to deal with overload (recipients could send
anonymous messages back to the collator to release the next batch of
messages), how to detect byzantine distributors (and complain about them
safely), stuff like that. And clearly there are a lot of tradeoffs
between ring size, message size, time window, latency, dataset size,
total system bandwidth, number-of-distributors safety margins, etc.

cheers,
 -Brian

[1]: http://www.freehaven.net/anonbib/cache/sassaman:wpes2005.pdf