Proactive Spam Fighting with Akonadi

One year ago I finished my Master Thesis and I just realized that I never really blogged about it. I had the chance to build my implementation during the Thesis upon the Akonadi framework, which is the reason why I want to share what I worked about. For all who might be more interested in my Thesis, a publisher was so kind to publish it.

The title of my Thesis was "Implementation of Proactive Spam Fighting Techniques" and the Thesis covered the implementation of two orthogonal techniques. Both techniques shared the idea to eliminate Spam before it hits the user’s inbox. Current Spam fighting techniques like for example SpamAssassin are reactive. SpamAssassin uses a rule based approach and a Bayesian filter. Rule based approaches cannot identify Spam messages reliable and have the danger to incorrectly mark ham messages if like last year the date grossly in the future becomes present. Bayesian filtering requires to collect a rather large amount of mail messages before messages can be filtered. This illustrates that reactive Spam fighting is no real solution.

Both implementations required to interact with the user’s inbox and in one case it was required to automatically send out mail on mail reception. The original idea of my tutors at the Laboratory for Dependable Distributed Systems was to develop either a plugin for Mozilla Thunderbird or Microsoft Outlook. With my background of KDE development I had to think of one framework which handles this much better: Akonadi. Thanks to Akonadi I was able to implement the solution in a client and platform independent way. Instead of just supporting Mozilla Thunderbird the solution works on all systems supporting Akonadi and no specific mail client is required. It can even be used by users who just use a web mail client.

Mail-Shake

Mail-Shake is the name of one of the two projects. The concept is difficult to grasp, therefore I just copy the description from my thesis:

The basic idea behind Mail-Shake is to block all mails from unauthenticated senders and to provide senders an easy way to authenticate themselves. The process of authentication is done in a way so that humans are able to participate, while computers – and by that spam bots – are not. After authentication the sender’s address is put on a whitelist. The whitelist is used by Mail-Shake to decide if a mail is authorized or not. By that the concept is proactive as it blocks spam before it is read by the user.

The Mail-Shake process

These initial steps of authentication are illustrated in the Figure above. A sender (User A) has to send a mail to User B’s public email address. All mails sent to the public address are discarded, but answered with a challenge mail containing a unique identifier. User A has to solve the challenge, which reveals User B’s private address. Now User A can resend the original mail with the identifier in the subject. Mail-Shake compares the identifier and puts User A’s address on a whitelist. In future User A can send mails directly to User B’s private address. The authentication step is required only once. As well there is no need to include the identifier in each single mail. Other mails sent to the private address are discarded if the sender address is not on the whitelist.

For the implementation I wrote an Akonadi Agent which monitors a set of public and private collections. Whenever a mail is received at the public address, the agent generates the unique Id, stores it in a database and sends the challenge mail to the sender’s address. The handling of the private collection is more complex. The agent has to decide whether the mail is authorized to be send to the user or not. It first checks the whitelist, if not on the whitelist it has to decide whether it’s a reply to a challenge mail and if it is whether the id is correct and update the whitelist.

The original paper considered that a mail gets just dropped if it is not on the whitelist. That is where I disagreed and implemented an automated mail notification to the sender in the case of not being on the whitelist or incorrect copying of the unique id. Furthermore I did not just delete the mails but used the functionality of Akonadi to allow the mails to be be moved to a different collection.

The concept so far works wonderful for human communication. Humans are able to solve the challenge (in case of my implementation reCAPTCHA Mailhide) and to resend the mail. All spam messages are automatically deleted without disturbing the user. Wait, there is more: communication with machines. I receive far more mails send from machines than from humans: forum, mailing lists, web shops, review board and many, many more. Nobody in the web shop will answer the challenge mail; Mail-Shake will break.

The solution to this problem is to allow users to modify the whitelist by themselves. Of course this is tricky and could make the usage too difficult. Because of that I played again with KDE technology. My Mail-Shake agent uses a StatusNotifier which is hidden most of the time. Whenever a mail is received, which would be discarded, the StatusNotifier changes to the active state and becomes visible. It allows to just add the addresses of the mails in the current to be discarded queue to the whitelist.

Notification on receipt of a not-whitelisted mail

In order to make it even more usable to the user, Mail-Shake can show normal Plasma notifications with an Action to add the address to the whitelist. The idea behind it is, that whenever you expect a mail from an automated system, you turn on the notification through the StatusNotifier and wait for the mail to arrive. (Btw a nice example for a notification which is completely useless when not having actions on it).

After the implementation I had tested Mail-Shake on my own mail system for two months. It illustrated that the implementation and the concept works. Spam mails were successfully removed and humans authenticated themselves. Nevertheless some problems occurred. The first is that the assumption for interaction with web-shops does not hold. Just because you buy at shop foo and receive a mail notification from @foo.com does not mean that further mails (e.g. the bill) will not be sent from @bar.com. This makes interacting with automated systems hardly possible without manually checking the filtered mails, in which case nothing is won in comparison to the existing reactive systems.

The second problem occurred when sending out the response challenge mails. As most mails are sent in reply to Spam mails we can expect that there is no mail system to accept the mails. Our mail transfer agent (MTA) will therefore send us a delivery status notification (DSN). Mail-Shake can handle those and will not send another challenge mail in response to the DSN and remove it, so that the user does not see it.

Mail-Shake supports delivery status notifications as described in RFC 3464, which does not specify a reference to the original mail, which makes it difficult to map the DSN to the sent mail and to decide if the DSN needs to be delivered to the user or not. Luckily during my evaluation each DSN included either the complete original message or at least the header section. So this seems to not be a problem in praxis. But the real problem is that there are MTAs not sending DSNs as described in RFC 3464. They use a custom format. Some of the non-standard conform MTAs are Exim, QMail (at least in older versions) and the MTA used by Google. The mail format is custom which makes it impossible to parse the mail automatically. In case of Exim it seems to be possible to customize the notifications. Mail-Shake is not able to recognize the DSN and starts to go into an infinite mail sending loop.

My personal resume is that Mail-Shake is not in a state to be used in a productive environment. The problem is not in the concept or in the implementation, but in MTAs not supporting RFC 3464. It is unlikely that MTAs not supporting the RFC for years will start to support it and even if, there are enough legacy installations to destroy any hope in having this concept in a workable state. It was rather frustrating to notice that the concept breaks because of the external state of the system.

Mail-Shake Client Integration

In my implementation I relied on reCAPTCHA mailhide API to protect the private mail address. The recipient of a challenge mail has to visit the reCAPTCHA website to reveal the address. This means we encourage users to click on links in mails. That is of course the opposite of what we teach users: never click a link in a mail you did not expect! To solve this issue we need an integration into the mail clients. The user should not notice that he received a challenge mail, the client has to guide the user in solving the challenge.

As at the time of writing my Master Thesis KMail was in heavy porting to Akonadi, I decided to implement the proof-of-concept integration in Mailody (which already supported Akonadi and WebKit at that time). The integration was straight forward and easy to implement. Mailody was a very nice application from that point of view. I extracted the Mail-Shake headers and used them to recognize that the mail is a challenge and integrated a link to solve the challenge. If you clicked the link a dialog opened, connected to reCAPTCHA, extracted the CAPTCHA and offered an input field to enter the text. The solved CAPTCHA was submitted to the web server and the result parsed, providing a localized info whether the challenge was solved or the CAPTCHA incorrect.

Solving a Mail-Shake Challenge in Mailody

The next step would have been to automatically resend the mail with the solved challenge, but I did not implement that. The idea was to just show that a client implementation is possible to solve the mail clicking problem.

Spam Templates

The second part of my Thesis is much smaller, but in my opinion the more interesting one. The laboratory had setup a system to execute Spam Bots in a controlled environment and to intercept the Spam mails it was sending out. The communication to their Master&Control servers was kept alive. The bot and the Master&Control server is by that thinking that they are working correctly. The Spam Bot is regularly fetching new templates from the control server and generates mails based on that template. These mails are intercepted and the template is reverse engineered. The exact process is described in the paper Towards Proactive Spam Filtering.

From the previous existing setup I had a set of regular expression based templates and a set of spam messages which were used to construct the templates. As the system was not up and running I did not have recent templates. Because of that I could only evaluate against the existing messages.

My implementation was based on another Akonadi Agent which reacted on the reception of new mails. Any new mail is checked against the templates and a score is generated from the matched lines of the template. The Agent would receive new templates through an RSS system (thanks to libsyndication). In the ideal situation new templates would be created at the same time as a new Spam campaign is started and distributed to all clients. So at the time the client receives the first Spam message of the new campaign the template would already exist. As the template is only generated from Spam messages it would be impossible to match Ham messages. This has an incredible advantage compared to the existing reactive filtering solutions: you don’t have to collect Spam messages before you are able to filter them.

Unfortunately this is just a research project, so there is no server broadcasting the Spam templates. Which makes my implemented system rather useless and did not allow me to verify if the implementation works correctly – I only know that it does not filter Ham messages.

Where to get the Code?

Of course the complete code has been written as free software, so where can you get it? I used git as version control system, so it has not been available in the KDE repository and after the switch to git I have not yet considered to upload it. The code and packages are available through the openSUSE Build Service and a live CD is available through SUSE Studio (awesome tool). It is possible that some parts are not working anymore as it seems that reCAPTCHA changed their web site (which fails my parser).

What else?

Of course I could only present a small part of my Thesis in one blog post. There are lots more interesting things to find in my Thesis. For example I was able to break the scr.im email address protection system using OpenGL Shading Language and OCR with a reliability of more than 99 %. I also did an investigation on existing CAPTCHA systems which showed me that most systems are in fact completely broken and just an annoyance to the user who has to use them.

Last but not least I want to thanks the KDE PIM team for providing such an awesome framework. Without Akonadi this Thesis would not have been possible. It is just incredible what can be done with Akonadi and I’m looking forward to KMail 2 ever since I started to work with Akonadi. I always feel sorry if I have to read the bad user responses to Akonadi. Don’t get frustrated, keep up the good work!

=-=-=-=-=
Powered by Blogilo

10 thoughts on “Proactive Spam Fighting with Akonadi

  1. I would love to see a pro active solution to actually be pro active and attack spam machines :) Something like a p2p network of mail clients that would communicate together and coordinate attacks on compromised machines/servers :) That way we would get rid of spam faster than in a blink of an eye, and if i understood it right then the “sniffed” templates could lead us to the source :)

    Anyway, different topic this week but still i love it as your beautiful piece of programming will/might improve our lives :)

    BTW, anything new on kwin front? I am awaiting all your blog posts regarding KDE technology especially kwin.

    1. actively attacing the spam bots would at least in Germany be against the law. And it would not help much: a botnet has thousand of spam bots, what does it care about the few bots being attacked? Even the Master&Control servers cannot be easily attacked, as they use a P2P approach nowadays. It is seldem that like the Waladec botnet can be destroyed (which was btw infilitrated by a Student of the same laboratory I did my Thesis).

      Concerning KWin: new blog post probably later the week.

      1. Unfortunatelly you are right.

        Question: Pro active approach to spam mail is only possible on desktop side right? If so then the “traditional” approaches to anti spam will still exist on server side with no other alternative. And judging by my spam folder on gmail it works :)
        In theory, wouldn’t some sort of crowd sourcing work for getting rid of spam? Everyone would sort of “chip in” marking mail as spam etc.

        Isn’t it a bit better than too strict rules on traditional spam solutions?

  2. The problem with Mail-Shake is not lack of RFC 3464-compliant systems but indeed the concept is flawed; if I read your overview correctly, Mail-Shake is just a newfangled name for yet another challenge-response system. I hoped that the time of C/R are long over as it is a bad solution which actually worsens the Spam problem with eg. backscatter mail to forged sender addresses. I recommend this link list from 2006 by Justin Mason (original creator of SpamAssassin): http://taint.org/2006/12/14/130136a.html

    (A slightly better working alternative might be the integration of C/R with something like greylisting — ie. your outgoing server announces that it supports greylisting, your mail is rejected at SMTP level with a special temporary error code and then your C/R system jumps in and allows you to have your mail delivered on time instead of you having to wait for your server to redeliver. Still, you want to accept mail which didn’t pass your C/R — or maybe not, but expect to lose correspondence then.)

    Yes, reactive solutions like SpamAssassin do have their problems, but as anybody outside academia can tell you, they are the only working solution which do not break the currently existing mail system.

    BTW: I would have looked at your thesis or the linked article to see if there’s anything new inside, but come on, 90$ just to read your thesis?

  3. Your assumption that your mail to a spam address will not go to a real account is seriously flawed. A system like you are describing should NEVER be allowed to be put to use.

    Actually, any spam prevention system that sends a response to a spam e-mail should automatically be put on a blacklist and the internet access for the user should be cut off!

    Why? You do not need any kind of thesis these days to realise that the from address is forged. And it often ends up in a real address. So some poor user out there will suddenly receive a few thousand messages about spam messages that he never sent in the first place. Sending any kind of response to a spam message doubles the problem. The only real solution is to delete it with no response at all.

    The spam problem can only be tackled with some kind of authentication between servers. Spam should never be allowed to leave the transmitting server. That is the place to stop it.

  4. This is fail. Historically. What will happen. Is attacker will find two parties with challange systems. And send both fake emails to send to each other.

    Its a multiplier of attack. Attacker send 2 messages and if attacker is really lucky unlimited get sent between the 2 machines. Ie sending challange request in response to a challange request. If attacker am unlucky normally get to double attack force 4 messages sent and recieved. Even better if attackers really lucky you blacklist the email that was faked. So making more damage.

    Jimm everything ASSP can do can be used by spamassassin. Both ASSP and spamassassin are only as good as the person setting them up.

  5. Martin, it was nice to discovered you used Mailody for it! I still regret having to give up development…

    As long as the From address can be manipulated spam will exists. Making SPF mandatory might solve some issues.

    Feel free to keep using Mailody :)

    Best,

    Toma

Comments are closed.