One year ago I finished my Master Thesis and I just realized that I never really blogged about it. I had the chance to build my implementation during the Thesis upon the Akonadi framework, which is the reason why I want to share what I worked about. For all who might be more interested in my Thesis, a publisher was so kind to publish it.
The title of my Thesis was "Implementation of Proactive Spam Fighting Techniques" and the Thesis covered the implementation of two orthogonal techniques. Both techniques shared the idea to eliminate Spam before it hits the user’s inbox. Current Spam fighting techniques like for example SpamAssassin are reactive. SpamAssassin uses a rule based approach and a Bayesian filter. Rule based approaches cannot identify Spam messages reliable and have the danger to incorrectly mark ham messages if like last year the date grossly in the future becomes present. Bayesian filtering requires to collect a rather large amount of mail messages before messages can be filtered. This illustrates that reactive Spam fighting is no real solution.
Both implementations required to interact with the user’s inbox and in one case it was required to automatically send out mail on mail reception. The original idea of my tutors at the Laboratory for Dependable Distributed Systems was to develop either a plugin for Mozilla Thunderbird or Microsoft Outlook. With my background of KDE development I had to think of one framework which handles this much better: Akonadi. Thanks to Akonadi I was able to implement the solution in a client and platform independent way. Instead of just supporting Mozilla Thunderbird the solution works on all systems supporting Akonadi and no specific mail client is required. It can even be used by users who just use a web mail client.
Mail-Shake
Mail-Shake is the name of one of the two projects. The concept is difficult to grasp, therefore I just copy the description from my thesis:
The basic idea behind Mail-Shake is to block all mails from unauthenticated senders and to provide senders an easy way to authenticate themselves. The process of authentication is done in a way so that humans are able to participate, while computers – and by that spam bots – are not. After authentication the sender’s address is put on a whitelist. The whitelist is used by Mail-Shake to decide if a mail is authorized or not. By that the concept is proactive as it blocks spam before it is read by the user.
These initial steps of authentication are illustrated in the Figure above. A sender (User A) has to send a mail to User B’s public email address. All mails sent to the public address are discarded, but answered with a challenge mail containing a unique identifier. User A has to solve the challenge, which reveals User B’s private address. Now User A can resend the original mail with the identifier in the subject. Mail-Shake compares the identifier and puts User A’s address on a whitelist. In future User A can send mails directly to User B’s private address. The authentication step is required only once. As well there is no need to include the identifier in each single mail. Other mails sent to the private address are discarded if the sender address is not on the whitelist.
For the implementation I wrote an Akonadi Agent which monitors a set of public and private collections. Whenever a mail is received at the public address, the agent generates the unique Id, stores it in a database and sends the challenge mail to the sender’s address. The handling of the private collection is more complex. The agent has to decide whether the mail is authorized to be send to the user or not. It first checks the whitelist, if not on the whitelist it has to decide whether it’s a reply to a challenge mail and if it is whether the id is correct and update the whitelist.
The original paper considered that a mail gets just dropped if it is not on the whitelist. That is where I disagreed and implemented an automated mail notification to the sender in the case of not being on the whitelist or incorrect copying of the unique id. Furthermore I did not just delete the mails but used the functionality of Akonadi to allow the mails to be be moved to a different collection.
The concept so far works wonderful for human communication. Humans are able to solve the challenge (in case of my implementation reCAPTCHA Mailhide) and to resend the mail. All spam messages are automatically deleted without disturbing the user. Wait, there is more: communication with machines. I receive far more mails send from machines than from humans: forum, mailing lists, web shops, review board and many, many more. Nobody in the web shop will answer the challenge mail; Mail-Shake will break.
The solution to this problem is to allow users to modify the whitelist by themselves. Of course this is tricky and could make the usage too difficult. Because of that I played again with KDE technology. My Mail-Shake agent uses a StatusNotifier which is hidden most of the time. Whenever a mail is received, which would be discarded, the StatusNotifier changes to the active state and becomes visible. It allows to just add the addresses of the mails in the current to be discarded queue to the whitelist.
In order to make it even more usable to the user, Mail-Shake can show normal Plasma notifications with an Action to add the address to the whitelist. The idea behind it is, that whenever you expect a mail from an automated system, you turn on the notification through the StatusNotifier and wait for the mail to arrive. (Btw a nice example for a notification which is completely useless when not having actions on it).
After the implementation I had tested Mail-Shake on my own mail system for two months. It illustrated that the implementation and the concept works. Spam mails were successfully removed and humans authenticated themselves. Nevertheless some problems occurred. The first is that the assumption for interaction with web-shops does not hold. Just because you buy at shop foo and receive a mail notification from @foo.com does not mean that further mails (e.g. the bill) will not be sent from @bar.com. This makes interacting with automated systems hardly possible without manually checking the filtered mails, in which case nothing is won in comparison to the existing reactive systems.
The second problem occurred when sending out the response challenge mails. As most mails are sent in reply to Spam mails we can expect that there is no mail system to accept the mails. Our mail transfer agent (MTA) will therefore send us a delivery status notification (DSN). Mail-Shake can handle those and will not send another challenge mail in response to the DSN and remove it, so that the user does not see it.
Mail-Shake supports delivery status notifications as described in RFC 3464, which does not specify a reference to the original mail, which makes it difficult to map the DSN to the sent mail and to decide if the DSN needs to be delivered to the user or not. Luckily during my evaluation each DSN included either the complete original message or at least the header section. So this seems to not be a problem in praxis. But the real problem is that there are MTAs not sending DSNs as described in RFC 3464. They use a custom format. Some of the non-standard conform MTAs are Exim, QMail (at least in older versions) and the MTA used by Google. The mail format is custom which makes it impossible to parse the mail automatically. In case of Exim it seems to be possible to customize the notifications. Mail-Shake is not able to recognize the DSN and starts to go into an infinite mail sending loop.
My personal resume is that Mail-Shake is not in a state to be used in a productive environment. The problem is not in the concept or in the implementation, but in MTAs not supporting RFC 3464. It is unlikely that MTAs not supporting the RFC for years will start to support it and even if, there are enough legacy installations to destroy any hope in having this concept in a workable state. It was rather frustrating to notice that the concept breaks because of the external state of the system.
Mail-Shake Client Integration
In my implementation I relied on reCAPTCHA mailhide API to protect the private mail address. The recipient of a challenge mail has to visit the reCAPTCHA website to reveal the address. This means we encourage users to click on links in mails. That is of course the opposite of what we teach users: never click a link in a mail you did not expect! To solve this issue we need an integration into the mail clients. The user should not notice that he received a challenge mail, the client has to guide the user in solving the challenge.
As at the time of writing my Master Thesis KMail was in heavy porting to Akonadi, I decided to implement the proof-of-concept integration in Mailody (which already supported Akonadi and WebKit at that time). The integration was straight forward and easy to implement. Mailody was a very nice application from that point of view. I extracted the Mail-Shake headers and used them to recognize that the mail is a challenge and integrated a link to solve the challenge. If you clicked the link a dialog opened, connected to reCAPTCHA, extracted the CAPTCHA and offered an input field to enter the text. The solved CAPTCHA was submitted to the web server and the result parsed, providing a localized info whether the challenge was solved or the CAPTCHA incorrect.
The next step would have been to automatically resend the mail with the solved challenge, but I did not implement that. The idea was to just show that a client implementation is possible to solve the mail clicking problem.
Spam Templates
The second part of my Thesis is much smaller, but in my opinion the more interesting one. The laboratory had setup a system to execute Spam Bots in a controlled environment and to intercept the Spam mails it was sending out. The communication to their Master&Control servers was kept alive. The bot and the Master&Control server is by that thinking that they are working correctly. The Spam Bot is regularly fetching new templates from the control server and generates mails based on that template. These mails are intercepted and the template is reverse engineered. The exact process is described in the paper Towards Proactive Spam Filtering.
From the previous existing setup I had a set of regular expression based templates and a set of spam messages which were used to construct the templates. As the system was not up and running I did not have recent templates. Because of that I could only evaluate against the existing messages.
My implementation was based on another Akonadi Agent which reacted on the reception of new mails. Any new mail is checked against the templates and a score is generated from the matched lines of the template. The Agent would receive new templates through an RSS system (thanks to libsyndication). In the ideal situation new templates would be created at the same time as a new Spam campaign is started and distributed to all clients. So at the time the client receives the first Spam message of the new campaign the template would already exist. As the template is only generated from Spam messages it would be impossible to match Ham messages. This has an incredible advantage compared to the existing reactive filtering solutions: you don’t have to collect Spam messages before you are able to filter them.
Unfortunately this is just a research project, so there is no server broadcasting the Spam templates. Which makes my implemented system rather useless and did not allow me to verify if the implementation works correctly – I only know that it does not filter Ham messages.
Where to get the Code?
Of course the complete code has been written as free software, so where can you get it? I used git as version control system, so it has not been available in the KDE repository and after the switch to git I have not yet considered to upload it. The code and packages are available through the openSUSE Build Service and a live CD is available through SUSE Studio (awesome tool). It is possible that some parts are not working anymore as it seems that reCAPTCHA changed their web site (which fails my parser).
What else?
Of course I could only present a small part of my Thesis in one blog post. There are lots more interesting things to find in my Thesis. For example I was able to break the scr.im email address protection system using OpenGL Shading Language and OCR with a reliability of more than 99 %. I also did an investigation on existing CAPTCHA systems which showed me that most systems are in fact completely broken and just an annoyance to the user who has to use them.
Last but not least I want to thanks the KDE PIM team for providing such an awesome framework. Without Akonadi this Thesis would not have been possible. It is just incredible what can be done with Akonadi and I’m looking forward to KMail 2 ever since I started to work with Akonadi. I always feel sorry if I have to read the bad user responses to Akonadi. Don’t get frustrated, keep up the good work!
=-=-=-=-=
Powered by Blogilo