[messaging] Modern anti-spam and E2E crypto

Fri Sep 5 09:27:12 PDT 2014

Howdy Mike,

Thanks for the informative write-up!

In your conclusion you mention using bitcoin either as transactional
or initial buy-in payments, but you say it conflicts with your
philosophy of free communication. Would a basic proof-of-work header
(i.e. X-Hashcash[1]) be compatible with your free communication
philosophy?

Daniel

[1] - http://www.hashcash.org/

On Fri, Sep 5, 2014 at 8:07 AM, Mike Hearn <mike at plan99.net> wrote:
> Hey,
>
> Trevor asked me to write up some thoughts on how spam filtering and fully
> end to end crypto would interact, so it's all available in one message
> instead of scattered over other threads. Specifically he asked for brain
> dumps on:
>
> how does antispam currently work at large email providers
> how would widespread E2E crypto affect this
> what are the options for moving things to the client (and pros, cons)
> is this feasible for email?
> How do things change when moving from email to other sorts of async
> messaging (e.g. text messaging) or new protocols - i.e. are there unique
> aspects of existing email protocols, or are these general problems?
>
> Brief note about my background, to establish credentials:  I worked at
> Google for about 7.5 years. For about 4.5 of those I worked on the Gmail
> abuse team, which is very tightly linked with the spam team (they use the
> same software, share the same on-call rotations etc).
>
> Starting around mid-2010 we had put sufficient pressure on spammers that
> they were unable to make money using their older techniques, and some of
> them switched to performing industrial-scale hacking of accounts using
> compromised passwords (and then sending spam to the account's contacts), so
> I became tech lead of a new anti-hijacking team. We spent about 2.5 years
> beating the hijackers. In early 2013 we declared victory and a few months
> later, Edward Snowden revealed that the NSA/GCHQ was tapping the security
> system we had designed.
>
> Since then things seem to be pretty quiet. It's not implausible to say that
> from Gmail's perspective the spam war has been won .... for now, at least.
>
> In case you prefer videos to reading a few years ago I gave a talk at the
> RIPE64 conference in Ljubljana:
>
>    https://ripe64.ripe.net/archives/video/25/
>
> In January I left Google to focus on Bitcoin full time. My current project
> is a p2p crowdfunding app I want to use as a way to fund development of
> decentralised infrastructure.
>
> OK, here we go.
>
> A brief history of the spam war
>
> In the beginning ... there was the regex. Gmail does support regex filtering
> but only as a last resort. It's easy to make mistakes, like the time we
> accidentally blackholed email for an unfortunate Italian woman named "Olivia
> Gradina". Plus this technique does not internationalise, and randomising
> text to miss the blacklists is easy.
>
> The email community began sharing abusive IPs. Spamhaus was born. This
> approach worked better because it involved burning something that the
> spammer had to pay money to obtain. But it caused huge fights because the
> blacklist operators became judge, jury and executioner over people's mail
> streams. What spam actually is turned out to be a contentious issue. Many
> bulk mailers didn't think they were spamming, but in the absence of a clear
> definition sometimes blacklisters disagreed.
>
> Botnets appeared as a way to get around RBLs, and in response spam fighters
> mapped out the internet to create a "policy block list" - ranges of IPs that
> were assigned to residential connections and thus should not be sending any
> email at all. Botnets generate enormous amounts of spam by volume, but it's
> also the easiest spam to filter. Very little of my time on the Gmail
> spam/abuse team was spent thinking about botnets.
>
> Webmail services like Gmail came on the scene. The very first release of
> Gmail simply used spamassassin on the backend, but this was quickly deemed
> not good enough and a custom filter was built. The architect of the Gmail
> filter wrote a paper in 2006 which you can find here:
>
>    http://ceas.cc/2006/19.pdf
>
> I'll summarise it. The primary technique the new filter used was attempting
> to heuristically guess the sending domain for email (domains being harder to
> obtain and more stable than IPs), and then calculating reputations over
> them. A reputation is a score between 0-100 where 100 is perfectly good and
> 0 means always spam. For example if a sender had a reputation of 70 that
> means about 30% of the time we think their mail is spam and the rest of the
> time it's legit. Reputations are moving averages that are calculated based
> on a careful blend of manual feedbacks from the Report Spam/Not Spam buttons
> and "auto feedbacks" generated by the spam filter itself. Obviously, manual
> feedbacks have a lot more weight in the system and that allows the filter to
> self correct.
>
> This approach has another advantage - it eliminates all the political
> fighting. The new definition of spam is "whatever our users say spam is", a
> definition that cannot be argued with and is simultaneously crisp enough to
> implement, yet vague enough to adapt to whatever spammers come up with.
>
> It's worth noting a few things here:
>
> Reputation systems require the ability to read all email. It's not good
> enough to be able to see only spam, because otherwise the reputations have
> no way to self correct. The flow of "not spam" reports is just as important
> as the flow of spam reports. Most not spam reports are generated implicitly
> of course, by the act of not marking the message at all.
>
> You need to calculate reputations fast. If you receive mail with unknown
> reputations, you have no choice but to let it pass as otherwise you can't
> figure out if it's spam or not. That in turn incentivises spammers to try
> and outrun the learning system. The first version of the reputation system
> used MapReduce and calculated reputations in batch, so convergence took
> hours. Eventually it had to be replaced with an online system that
> recalculates scores on the fly. This system is a tremendously impressive
> piece of engineering - it's basically a global, real time peer to peer
> learning system. There are no masters. The filter is distributed throughout
> the world and can tolerate the loss of multiple datacenters.
>
> I don't want to think about how you'd build one of these outside a highly
> controlled environment, it was enough of a headache even in the
> proprietary/centralised setting ....
>
> Reputations propagate between each other. If we know a link is bad and it
> appears in mail from an IP with unknown reputation, then that IP gets a bad
> reputation too and vice versa. It turns out that this is important - as the
> number of things upon which reputations are calculated goes up, it becomes
> harder and harder for spammers to rotate all of them simultaneously.
> Especially this is true if using a botnet where precise control over the
> sending machines is hard. If a spammer fails to randomize even one tiny
> aspect of their mail at the same time as the others, all their links and IPs
> get automatically burned and they lose money.
>
> Reputation contains an inherent problem. You need lots of users, which
> implies accounts must be free. If accounts are free then spammers can sign
> up for accounts and mark their own email as not spam, effectively doing a
> sybil attack on the system. This is not a theoretical problem.
>
> The reputation system was generalised to calculate reputations over features
> of messages beyond just sending domain. A message feature can be, for
> example, a list of the domains found in clickable hyperlinks. Links would
> turn out to be a critical battleground that would be extensively fought over
> in the years ahead. The reason is obvious: spammers want to sell something.
> Therefore they must get users to their shop. No matter how they phrase their
> offer, the URL to the destination must work. The fight went like this:
>
> They start with clear clickable links in HTML emails. Filters start blocking
> any email with those links.
>
> They start obfuscating the links, and requesting users put the link back
> together. But this works poorly because many users either can't or won't
> figure it out, so profits fall.
>
> They start buying and creating randomised domains in bulk. TLDs like .com
> are expensive but others are cheap or free and the reputations of the entire
> TLDs went into freefall (like .cc)
>
> Spammers run out of abusable TLDs as registrars begin to crack down. They
> begin performing reputation hijacking, e.g. by creating blogs on sites which
> allow you to register *.blogspot.com,  *.livejournal.com and so on. URL
> shorteners become a spammers best friend. Literally every URL shortener
> immediately becomes a war zone as the operators and spammers fight to defend
> and attack the URL domain reputations.
>
> Spammers also start hacking websites but this doesn't work that well,
> because many websites don't often appear in legitimate mail often so they
> don't have strong reputations. Great source of passwords though.
>
> Big content hosting sites like Google begin connecting their spam filters to
> their hosting engines so once the reputation of a user-generated URL falls
> it's automatically terminated. The first iterations of this are too slow.
> One of my projects at Google was to build a real-time system to do this
> automatic content takedown.
>
> Obtaining fresh sending IP addresses was a problem for them too of course.
> The best fix was to use webmail services as anonymizing proxies. Gmail was
> hit especially hard by this because early on Paul Buchheit (the creator)
> decided not to include the client IP address in email headers. This was
> either a win for user privacy or a blatant violation of the RFCs, depending
> on who you asked. It also turned Gmail into the worlds biggest anonymous
> remailer - a real asset for spammers that let them sail right past most
> filters which couldn't block messages from a sender as large as Google.
>
> Between about 2006 (open signups) and 2010 a lot of the anti-spam work
> involved building a spam filter for account signups. We did a pretty good
> job, even though I say so myself. You can see the prices of different kinds
> of "free" webmail accounts at http://buyaccs.com (a Russian account shop).
> Note that hotmail/outlook.com accounts cost $10 per thousand and gmails cost
> an order of magnitude more. When we started gmails were about $25 per 1000
> so we were able to quadruple the price. Going higher than that is hard
> because all big websites use phone verification to handle false positives
> and at these price levels it becomes profitable to just buy lots of SIM
> cards and burn phone numbers.
>
> There's a significant amount of magic involved in preventing bulk signups.
> As an example, I created a system that randomly generates encrypted
> JavaScripts that are designed to resist reverse engineering attempts. These
> programs know how to detect automated signup scripts and entirely wiped them
> out.
>
> How would widespread E2E crypto affect all this
>
> You can see several themes in the above story:
>
> Large volumes of data is really important, of both legit and spam messages.
> Extremely high speed is important. A lot of spam fights boil down to a game
> of who is faster. If your reputations converge in 3 minutes then you're
> going to be outrun.
> Being able to police your user base is important. You can't establish
> reputations if you can't trust your user reports and that means creating a
> theoretically impossible situation:   accounts that are free yet also cost
> money   (if you need lots of them)
>
> The first problem we have in the E2E context is that reputation databases
> require input from all mail. We can imagine an email client that knows how
> to decrypt a message, performs feature extraction and then uploads a "good
> mail" or "bad mail" report to some <handwave> central facility. But then
> that central facility is going to learn not only who you are talking with
> but also what links are in the mail. That's probably quite valuable
> information to have. As you add features this problem gets worse.
>
> The second problem we have is that if the central reputation aggregator
> can't read your mails, it doesn't know if you did feature extraction
> honestly. This is not a problem in the unencrypted context because the spam
> filter extracts features itself. Whilst spammers can try to game the system,
> they still have to actually send their spams to themselves for real, and
> this imposes a cost. In a world where spam filters cannot read the message,
> spammers can just submit entirely fictional "good mail" reports. Worse,
> competitors could interfere with each others mail streams by submitting
> false reports. We see this sort of thing with AdWords.
>
> The third problem is that spam filters rely quite heavily on security
> through obscurity, because it works well. Though some features are well
> known (sending IP, links) there are many others, and those are secret. If
> calculation was pushed to the client then spammers could see exactly what
> they had to randomise and the cross-propagation of reputations wouldn't work
> as well.
>
> It might be possible to resolve the above two problems using trusted
> computing. With TC you can run encrypted software on private data and the
> hardware will "prove" what it ran to a remote server. But security through
> obscurity and end to end crypto are hard to mix - if you run your email
> content through a black box, that black box could potentially steal the
> contents. You have to trust the entity calculating the secret sauce with
> your message, and then you could just use Gmail in the regular way as today.
>
> The fourth problem we have is that anonymous usage and spam filters don't
> really mix. Ultimately there's no replacement for cutting spam off at the
> source. Account termination is a fundamental spam fighting tool.  All major
> webmail and social services force users to perform phone verification if
> they trip an abuse filter. This sends a random code via SMS or voice call to
> a phone number and verifies the user can receive it. It works because phone
> numbers are a resource that have a cost associated with them, yet ~all users
> have one. But in many countries it's illegal to have anonymous mobile
> numbers and operators are forced to do ID verification before handing out a
> SIM card. The fact that you can be "name checked" at any moment with
> plausible deniability means that whilst you don't have to provide any
> personal data to get a webmail account, a government could force you to
> reveal your location and/or identity at any time. They don't even have to do
> anything special; if they can phish your password they can forcibly trip the
> abuse filter, wait for the user to pass phone verification, then get a
> warrant for the users account metadata knowing that it now contains what
> they need   (I never saw any evidence of this, but it's theoretically
> possible).
>
> The final problem we have is that spam filtering is resource intensive CPU
> and disk wise. Many, many users now access their email exclusively via a
> smartphone. Smartphones do not have many resources and the more work you do,
> the worse the battery life. Simply waking up the radio to download a message
> uses battery. Attempting to do even obsolete 1990's style spam filtering of
> all mail received with a phone would probably be a non starter unless
> there's some fundamental breakthrough in battery technology.
>
> In conclusion, I don't see a return to pure client side filtering being
> feasible.
>
> How do things change when moving from email to other sorts of async
> messaging ?
>
> Well. SMS spam is a thing. It doesn't happen much because phone companies
> act as spam filters. Also, because governments tend to get involved with the
> punishment of SMS spammers, in order to discourage copycat offenders and
> send a message (pun totally intended). Email spam blew up way before
> governments could react to it, so it's interesting to see the different
> paths these systems have taken.
>
> Systems like WhatsApp don't seem to suffer spam, but I presume that's just
> an indication that their spam/abuse team is doing a good job. They are in
> the easiest position. When you have central control everything becomes a
> million times easier because you can change anything at any time. You can
> terminate accounts and control signups. If you don't have central control,
> you have to rely exclusively on inbound filtering and have to just suck it
> up when spammers try to find ways around your defences. Plus you often lose
> control over the clients.
>
>
> General thoughts and conclusions
>
> When you look at what it's taken to win the spam war with cleartext, it's
> been a pretty incredible effort stretched over many years. "War" is a good
> analogy: there were two opposing sides and many interesting battles,
> skirmishes tactics and weapons. I could tell stories all day but this email
> is already way too long.
>
> Trying to refight that in the encrypted context would be like trying to
> fight a regular war blindfolded and handcuffed. You'd be dead within
> minutes.
>
> So I think we need totally new approaches. The first idea people have is to
> make sending email cost money, but that sucks for several reasons; most
> obviously - free global communication is IMHO one of humanities greatest
> achievements, right up there with putting a man on the moon. Someone from
> rural China can send me a message within seconds, for free, and I can reply,
> for free! Think about that for a second.
>
> The other reason it sucks is that it confuses bulk mail with spam. This is a
> very common confusion. Lots of companies send vast amounts of mail that
> users want to receive. Think Facebook, for example. If every mail cost
> money, some legit and useful businesses wouldn't work, let alone things like
> mailing lists.
>
> A possibly better approach is to use money to create deposits. There is a
> protocol that allows bitcoins to be sacrificed to miners fees, letting you
> prove that you threw money away by signing challenges with the keys that did
> so. This would allow very precise establishment of an anonymous yet costly
> credential that can then send as much mail as it wants, and have reputations
> calculated over it. Spam/not spam reports that only contain proof of sending
> could then be scatter/gathered and used to calculate a reputation, or if
> there is none, then such mails could be throttled until a few volunteers
> have peeked inside. Another approach would be to allow cross-signing - an
> entity with good reputation can temporarily countersign mail to give it a
> reputational boost and trigger cross-propagation of reputations. That entity
> could employ whatever techniques they liked to verify the senders
> legitimacy.
>
> It's for these reasons that I'm interested in the overlap between Bitcoin
> and E2E messaging. It seems to me they are fundamentally linked.
>
> Final thought. I'm somewhat notorious in the Bitcoin community for making
> radical suggestions, like maybe there exists a tradeoff between privacy and
> abuse. Lots of people in the crypto community passionately hate this idea
> and (unfortunately) anyone who makes it. I guess you can see based on the
> above stories why I think this way though. It's not clear to me that chasing
> perfect privacy whilst ignoring abuse is the right path for any system that
> wishes to achieve mainstream success.
>
> _______________________________________________
> Messaging mailing list
> Messaging at moderncrypto.org
> https://moderncrypto.org/mailman/listinfo/messaging
>