[messaging] Modern anti-spam and E2E crypto

Tue Sep 9 21:22:25 PDT 2014

In my mind, the natural thing to pair with E2E encrypted email is
searchable symmetric encryption (SSE) because it offers some trivial
solutions to the above problems (in addition to... search).

Basically, the client generates an encrypted version of an inverted index
that allows the server, given a trapdoor for a keyword X, to learn which
files contain X and no more (including the plaintext value of the keyword
or word distribution).  You can then build on top of that more complex
predicates, like "find files that contain X or Y" and some schemes use
order preserving symmetric encryption (OPSE) to enable ranked results
(RSSE).

When a user checks their mail, the client fetches new messages and tells
the server how to update the index.  The operations involved are largely
symmetric (hence fast) and the information sent back to the server is
typically small--it depends on the particular construction.

ANYWAYS

A solution that was talked a lot about before was a reputation based on
source address.  E2E encryption itself doesn't really interfere with that.

In the previous post, it was mentioned that people who email a user can be
divided into the sets {1} (personal friends), {2} (associates), and {3}
(strangers).  Assigning an email to set {1} can be as simple as getting
trapdoors for the addresses a user frequently contacts.  Set {2} might be
emails that failed to go into {1}, but contain the user's name or diction
related to the user's job or interests.  Set {3} would be what remains.

However, a generalization of the above is Naive Bayes spam filtering,
similar to the 'feature extraction' mentioned:  Given several trapdoors for
positive and negative keywords and an indexed ciphertext (preferably with
ranks), the server can calculate the probability that a message is spam
with some formula by inputing an estimate of the total number of tokens
given by the ciphertext's size and what proportion of that is
positive/negative keywords.

That's a static system.  It can be made to learn by giving trapdoors for
the top n keywords--positive or negative by "not spam" or "spam"
respectively.

At this point I should note a few things:
1.)  There's no advantage to lying in this system.  You'll only mess up
your own account.
2.)  The server hasn't actually learned anything about the user through all
of this.  A 'trapdoor' reveals only what files contain a keyword--not the
keyword or statistical information about the file.
3.)  The user is largely unaware of the specifics of how the server
implements spam filtering, other than as a function of certain trapdoors.
In the case where a server might request a trapdoor from the user, creative
use of blinding can hide even that.

To allow information to propagate across accounts, the above information
can be considered along with the reputation of a sender's IP/source
address, computed by through "spam" and "not spam" reports and whether or
not other peoples' filters have consistently marked a user/source as spam.

Of course, it's practically impossible to preserve all data utility, so
plaintext samples with the user's consent are plausible ways to fine tune
functions like those used in Naive Bayes, but a random sample with
negligible non-response bias would be significantly smaller than *all*
available data.

At the moment, I'm not aware of any *major* short-comings in the above.  It
systematically requires more server disk space, but that's usually
considered the cheapest part of a system to scale.  Client CPU and
bandwidth usage is slightly higher, but in most (R)SSE constructions
there's no significant increase in required computational power for the
server.

(Apologies if I've ballsed up somehow.  I'm new to mailing lists.  >.>)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://moderncrypto.org/mail-archive/messaging/attachments/20140910/47e38889/attachment.html>