Steve Kemp's Blog

Debian & Free Software

About This Site

This is a simple blog relating to Debian & Free Software issues.

Archive

Entries tagged "spam".

10th September 2007

Recently the topic of spam on the Debian lists was revisited. I laugh at somebody who recieves 200 spam messages a day.

Here's my stats for yesterday:

                                          Total Mails    : 6399
                                          Total SPAM     : 6077
                                          Total Accepted : 322

                                          Spam Percentage: 94.97%

That's 6077 mails rejected at SMTP time via my filters, and only 322 mails accepted.

The breakdown of the spam rejected looks like this:

                                  Plugin      Count
--------------------------------------------------------------
                                   dnsbl       3755
                             hosts_allow        724
                             greylisting        661
                       check_earlytalker        303
                          check_spamhelo        238
             require_resolvable_fromhost        219
                           virus::clamav         79
                         check_badrcptto         75
                       check_badmailfrom         23
--------------------------------------------------------------


29th June 2008

There exist many bayasian/statistical spam filters, ranging from products such as spambayes, and spamassassin, to crm114. Each of them works in their own way. Having used and tested almost all of them I've noticed a common flaw.

The vast majority of spam-filters struggle to correctly classify "419 scam" mails, lottery fraud, and similar mails.

Why is that? In general, having read hundreds of these mails, I can see several things that are common in these kind of the mails:

  • Mention of currency in both numeric and word forms. ($1,000,000 + 1 million US dollars)
  • Mention of a country / nationality (Sierra Lione, Nigerian)
  • Mention of a reference/claim number and often "official address".
  • Christian references.
  • Greetings such as "dear friend", and mentions of discretion/secrecy.
  • Size. (A scam mail is typically greater in length than an average spam mail).

Whilst none of these individually are indicative of a scam mail it is interesting to count their combined occurance.

I've written a toy program to count these things, and so far the success rate is >60% which is a reasonable start - providing this kind of detection occurs after normal filtering.

I may experiment further, but I figured a public query on scam detection might be appropriate.

Whilst the detecting a scam mail is a subset of detecting a spam email there are probably simplifications that may be made, and exploring those wouldn't be a bad thing.

ObQuote: Buffy.

3 comments Tags: spam.
31st August 2008

When volume becomes high enough you start to observe patterns in SPAM pretty easily. I think that this is primarily because people like to see patterns, whether they are present or not.

The trick is determining whether they are real patterns or not, and then to a lesser extent whether they are useful patterns.

For example I host mail for a business domain. That means that incoming messages come primarily from existing customers, and very rarely from potential new ones.

In practise that means that email is expected to arrive from 9am til 6pm (+/-2hours) Email received at 2AM? Either it is somebody working remotely, a foreign contact, or much more likely it is SPAM.

Now clearly you cannot dump all messages received at unusual times of the day, but it is a surprisingly robust SPAM indicator for that particular domain.

All heuristics are fallable, but some are useful regardless..

I'd love to know what people can learn from their SPAM. This week I'm handling approximately 80,000 messages a day, per MX, which isn't huge (ie. 2-3 million a month).

ObQuote: Highlander

5 comments Tags: spam.

RSS feed

Tags

Created by Chronicle v3.1