[messaging] Modern anti-spam and E2E crypto

Fri Sep 5 08:07:30 PDT 2014

Hey,

Trevor asked me to write up some thoughts on how spam filtering and fully
end to end crypto would interact, so it's all available in one message
instead of scattered over other threads. Specifically he asked for brain
dumps on:

   - how does antispam currently work at large email providers
   - how would widespread E2E crypto affect this
   - what are the options for moving things to the client (and pros, cons)
   - is this feasible for email?
   - How do things change when moving from email to other sorts of
async messaging
   (e.g. text messaging) or new protocols - i.e. are there unique aspects
   of existing email protocols, or are these general problems?

Brief note about my background, to establish credentials:  I worked at
Google for about 7.5 years. For about 4.5 of those I worked on the Gmail
abuse team, which is very tightly linked with the spam team (they use the
same software, share the same on-call rotations etc).

Starting around mid-2010 we had put sufficient pressure on spammers that
they were unable to make money using their older techniques, and some of
them switched to performing industrial-scale hacking of accounts using
compromised passwords (and then sending spam to the account's contacts), so
I became tech lead of a new anti-hijacking team. We spent about 2.5 years
beating the hijackers. In early 2013 we declared victory
<http://googleblog.blogspot.ch/2013/02/an-update-on-our-war-against-account.html>
and
a few months later, Edward Snowden revealed that the NSA/GCHQ was tapping
the security system we had designed
<http://www.theguardian.com/technology/2013/nov/06/google-nsa-gchq-spying-judicial-process>
.

Since then things seem to be pretty quiet. It's not implausible to say that
from Gmail's perspective the spam war has been won .... for now, at least.

In case you prefer videos to reading a few years ago I gave a talk at the
RIPE64 conference in Ljubljana:

   https://ripe64.ripe.net/archives/video/25/

In January I left Google to focus on Bitcoin full time. My current project
is a p2p crowdfunding app I want to use as a way to fund development of
decentralised infrastructure.

OK, here we go.

*A brief history of the spam war*

In the beginning ... there was the regex. Gmail does support regex
filtering but only as a last resort. It's easy to make mistakes, like the
time we accidentally blackholed email for an unfortunate Italian woman
named "Oli*via Gra*dina". Plus this technique does not internationalise,
and randomising text to miss the blacklists is easy.

The email community began sharing abusive IPs. Spamhaus was born. This
approach worked better because it involved burning something that the
spammer had to pay money to obtain. But it caused huge fights because the
blacklist operators became judge, jury and executioner over people's mail
streams. What spam actually is turned out to be a contentious issue. Many
bulk mailers didn't think they were spamming, but in the absence of a clear
definition sometimes blacklisters disagreed.

Botnets appeared as a way to get around RBLs, and in response spam fighters
mapped out the internet to create a "policy block list" - ranges of IPs
that were assigned to residential connections and thus should not be
sending any email at all. Botnets generate enormous amounts of spam by
volume, but it's also the easiest spam to filter. Very little of my time on
the Gmail spam/abuse team was spent thinking about botnets.

Webmail services like Gmail came on the scene. The very first release of
Gmail simply used spamassassin on the backend, but this was quickly deemed
not good enough and a custom filter was built. The architect of the Gmail
filter wrote a paper in 2006 which you can find here:

   http://ceas.cc/2006/19.pdf

I'll summarise it. The primary technique the new filter used was attempting
to heuristically guess the sending domain for email (domains being harder
to obtain and more stable than IPs), and then calculating *reputations* over
them. A reputation is a score between 0-100 where 100 is perfectly good and
0 means always spam. For example if a sender had a reputation of 70 that
means about 30% of the time we think their mail is spam and the rest of the
time it's legit. Reputations are moving averages that are calculated based
on a careful blend of manual feedbacks from the Report Spam/Not Spam
buttons and "auto feedbacks" generated by the spam filter itself.
Obviously, manual feedbacks have a lot more weight in the system and that
allows the filter to self correct.

This approach has another advantage - it eliminates all the political
fighting. The new definition of spam is "whatever our users say spam is", a
definition that cannot be argued with and is simultaneously crisp enough to
implement, yet vague enough to adapt to whatever spammers come up with.

It's worth noting a few things here:

   - Reputation systems require the ability to read *all* email. It's not
   good enough to be able to see only spam, because otherwise the reputations
   have no way to self correct. The flow of "not spam" reports is just as
   important as the flow of spam reports. Most not spam reports are generated
   implicitly of course, by the act of not marking the message at all.

   - You need to calculate reputations *fast*. If you receive mail with
   unknown reputations, you have no choice but to let it pass as otherwise you
   can't figure out if it's spam or not. That in turn incentivises spammers to
   try and outrun the learning system. The first version of the reputation
   system used MapReduce and calculated reputations in batch, so convergence
   took hours. Eventually it had to be replaced with an online system that
   recalculates scores on the fly. This system is a tremendously impressive
   piece of engineering - it's basically a global, real time peer to peer
   learning system. There are no masters. The filter is distributed throughout
   the world and can tolerate the loss of multiple datacenters.

   I don't want to think about how you'd build one of these outside a
   highly controlled environment, it was enough of a headache even in the
   proprietary/centralised setting ....

   - Reputations propagate between each other. If we know a link is bad and
   it appears in mail from an IP with unknown reputation, then that IP gets a
   bad reputation too and vice versa. It turns out that this is important - as
   the number of things upon which reputations are calculated goes up, it
   becomes harder and harder for spammers to rotate all of them
   simultaneously. Especially this is true if using a botnet where precise
   control over the sending machines is hard. If a spammer fails to randomize
   even one tiny aspect of their mail at the same time as the others, all
   their links and IPs get automatically burned and they lose money.

   - Reputation contains an inherent problem. You need lots of users, which
   implies accounts must be free. If accounts are free then spammers can sign
   up for accounts and mark their own email as not spam, effectively doing a
   sybil attack on the system. This is not a theoretical problem.

The reputation system was generalised to calculate reputations over
*features* of messages beyond just sending domain. A message feature can
be, for example, a list of the domains found in clickable hyperlinks. Links
would turn out to be a critical battleground that would be extensively
fought over in the years ahead. The reason is obvious: spammers want to
sell something. Therefore they must get users to their shop. No matter how
they phrase their offer, the URL to the destination must work. The fight
went like this:

   1. They start with clear clickable links in HTML emails. Filters start
   blocking any email with those links.

   2. They start obfuscating the links, and requesting users put the link
   back together. But this works poorly because many users either can't or
   won't figure it out, so profits fall.

   3. They start buying and creating randomised domains in bulk. TLDs like
   .com are expensive but others are cheap or free and the reputations of the
   entire TLDs went into freefall (like .cc)

   4. Spammers run out of abusable TLDs as registrars begin to crack down.
   They begin performing *reputation hijacking*, e.g. by creating blogs on
   sites which allow you to register *.blogspot.com,  *.livejournal.com and
   so on. URL shorteners become a spammers best friend. Literally every URL
   shortener immediately becomes a war zone as the operators and spammers
   fight to defend and attack the URL domain reputations.

   5. Spammers also start hacking websites but this doesn't work that well,
   because many websites don't often appear in legitimate mail often so they
   don't have strong reputations. Great source of passwords though.

   6. Big content hosting sites like Google begin connecting their spam
   filters to their hosting engines so once the reputation of a user-generated
   URL falls it's automatically terminated. The first iterations of this are
   too slow. One of my projects at Google was to build a real-time system to
   do this automatic content takedown.

Obtaining fresh sending IP addresses was a problem for them too of course.
The best fix was to use webmail services as anonymizing proxies. Gmail was
hit especially hard by this because early on Paul Buchheit (the creator)
decided not to include the client IP address in email headers. This was
either a win for user privacy or a blatant violation of the RFCs, depending
on who you asked. It also turned Gmail into the worlds biggest anonymous
remailer - a real asset for spammers that let them sail right past most
filters which couldn't block messages from a sender as large as Google.

Between about 2006 (open signups) and 2010 a lot of the anti-spam work
involved building a spam filter for account signups. We did a pretty good
job, even though I say so myself. You can see the prices of different kinds
of "free" webmail accounts at http://buyaccs.com (a Russian account shop).
Note that hotmail/outlook.com accounts cost $10 per thousand and gmails
cost an order of magnitude more. When we started gmails were about $25 per
1000 so we were able to quadruple the price. Going higher than that is hard
because all big websites use phone verification to handle false positives
and at these price levels it becomes profitable to just buy lots of SIM
cards and burn phone numbers.

There's a significant amount of magic involved in preventing bulk signups.
As an example, I created a system that randomly generates encrypted
JavaScripts that are designed to resist reverse engineering attempts. These
programs know how to detect automated signup scripts and entirely wiped
them out
<http://webcache.googleusercontent.com/search?q=cache:v6Iza2JzJCwJ:www.hackforums.net/archive/index.php/thread-2198360.html+&cd=8&hl=en&ct=clnk&gl=ch>
.

*How would widespread E2E crypto affect all this*

You can see several themes in the above story:

   - Large volumes of data is really important, of both legit and spam
   messages.
   - Extremely high speed is important. A lot of spam fights boil down to a
   game of who is faster. If your reputations converge in 3 minutes then
   you're going to be outrun.
   - Being able to police your user base is important. You can't establish
   reputations if you can't trust your user reports and that means creating a
   theoretically impossible situation:   accounts that are free yet also cost
   money   (if you need lots of them)

The first problem we have in the E2E context is that reputation databases
require input from *all* mail. We can imagine an email client that knows
how to decrypt a message, performs feature extraction and then uploads a
"good mail" or "bad mail" report to some <handwave> central facility. But
then that central facility is going to learn not only who you are talking
with but also what links are in the mail. That's probably quite valuable
information to have. As you add features this problem gets worse.

The second problem we have is that if the central reputation aggregator
can't read your mails, it doesn't know if you did feature extraction
honestly. This is not a problem in the unencrypted context because the spam
filter extracts features itself. Whilst spammers can try to game the
system, they still have to actually send their spams to themselves for
real, and this imposes a cost. In a world where spam filters cannot read
the message, spammers can just submit entirely fictional "good mail"
reports. Worse, competitors could interfere with each others mail streams
by submitting false reports. We see this sort of thing with AdWords.

The third problem is that spam filters rely quite heavily on security
through obscurity, because it works well. Though some features are well
known (sending IP, links) there are many others, and those are secret. If
calculation was pushed to the client then spammers could see exactly what
they had to randomise and the cross-propagation of reputations wouldn't
work as well.

It might be possible to resolve the above two problems using trusted
computing. With TC you can run encrypted software on private data and the
hardware will "prove" what it ran to a remote server. But security through
obscurity and end to end crypto are hard to mix - if you run your email
content through a black box, that black box could potentially steal the
contents. You have to trust the entity calculating the secret sauce with
your message, and then you could just use Gmail in the regular way as today.

The fourth problem we have is that anonymous usage and spam filters don't
really mix. Ultimately there's no replacement for cutting spam off at the
source. Account termination is a fundamental spam fighting tool.  All major
webmail and social services force users to perform phone verification if
they trip an abuse filter. This sends a random code via SMS or voice call
to a phone number and verifies the user can receive it. It works because
phone numbers are a resource that have a cost associated with them, yet
~all users have one. But in many countries it's illegal to have anonymous
mobile numbers and operators are forced to do ID verification before
handing out a SIM card. The fact that you can be "name checked" at any
moment with plausible deniability means that whilst you don't have to
provide any personal data to get a webmail account, a government could
force you to reveal your location and/or identity at any time. They don't
even have to do anything special; if they can phish your password they can
forcibly trip the abuse filter, wait for the user to pass phone
verification, then get a warrant for the users account metadata knowing
that it now contains what they need   (I never saw any evidence of this,
but it's theoretically possible).

The final problem we have is that spam filtering is resource intensive CPU
and disk wise. Many, many users now access their email *exclusively* via a
smartphone. Smartphones do not have many resources and the more work you
do, the worse the battery life. Simply waking up the radio to download a
message uses battery. Attempting to do even obsolete 1990's style spam
filtering of all mail received with a phone would probably be a non starter
unless there's some fundamental breakthrough in battery technology.

In conclusion, I don't see a return to pure client side filtering being
feasible.

*How do things change when moving from email to other sorts of
async messaging ?*

Well. SMS spam is a thing. It doesn't happen much because phone companies
act as spam filters. Also, because governments tend to get involved with
the punishment of SMS spammers, in order to discourage copycat offenders
and send a message (pun totally intended). Email spam blew up way before
governments could react to it, so it's interesting to see the different
paths these systems have taken.

Systems like WhatsApp don't seem to suffer spam, but I presume that's just
an indication that their spam/abuse team is doing a good job. They are in
the easiest position. When you have central control everything becomes a
million times easier because you can change anything at any time. You can
terminate accounts and control signups. If you don't have central control,
you have to rely exclusively on inbound filtering and have to just suck it
up when spammers try to find ways around your defences. Plus you often lose
control over the clients.

*General thoughts and conclusions*

When you look at what it's taken to win the spam war with cleartext, it's
been a pretty incredible effort stretched over many years. "War" is a good
analogy: there were two opposing sides and many interesting battles,
skirmishes tactics and weapons. I could tell stories all day but this email
is already way too long.

Trying to refight that in the encrypted context would be like trying to
fight a regular war blindfolded and handcuffed. You'd be dead within
minutes.

So I think we need totally new approaches. The first idea people have is to
make sending email cost money, but that sucks for several reasons; most
obviously - free global communication is IMHO one of humanities greatest
achievements, right up there with putting a man on the moon. Someone from
rural China can send me a message within seconds, for free, and I can
reply, for free! Think about that for a second.

The other reason it sucks is that it confuses bulk mail with spam. This is
a very common confusion. Lots of companies send vast amounts of mail that
users want to receive. Think Facebook, for example. If every mail cost
money, some legit and useful businesses wouldn't work, let alone things
like mailing lists.

A possibly better approach is to use money to create deposits. There is a
protocol that allows bitcoins to be sacrificed to miners fees, letting you
prove that you threw money away by signing challenges with the keys that
did so. This would allow very precise establishment of an anonymous yet
costly credential that can then send as much mail as it wants, and have
reputations calculated over it. Spam/not spam reports that *only* contain
proof of sending could then be scatter/gathered and used to calculate a
reputation, or if there is none, then such mails could be throttled until a
few volunteers have peeked inside. Another approach would be to allow
cross-signing - an entity with good reputation can temporarily countersign
mail to give it a reputational boost and trigger cross-propagation of
reputations. That entity could employ whatever techniques they liked to
verify the senders legitimacy.

It's for these reasons that I'm interested in the overlap between Bitcoin
and E2E messaging. It seems to me they are fundamentally linked.

Final thought. I'm somewhat notorious in the Bitcoin community for making
radical suggestions, like maybe there exists a tradeoff between privacy and
abuse. Lots of people in the crypto community passionately hate this idea
and (unfortunately) anyone who makes it. I guess you can see based on the
above stories why I think this way though. It's not clear to me that
chasing perfect privacy whilst ignoring abuse is the right path for any
system that wishes to achieve mainstream success.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://moderncrypto.org/mail-archive/messaging/attachments/20140905/e09e4700/attachment.html>