Skip to content

Entries tagged "blogspam".

You do know there are more guns in the country than there are in the city.

Lenny Backports

After a couple of days I've spotted a few things that don't work so well on Lenny:

gtk-gnutella

gtk-gnutella is a client for a peer-to-peer filesharing system. Unfortunately the version of the client in Lenny dies on startup "This version is too old to connect".

gimp

The graphics program, The Gimp, doesn't show a live preview when carrying out things such as colour desaturation.

Although not an insurmountable problem it is moderately annoying if you do such things often.

So I've placed backported packages online.

I expected to have to backport KVM, and I guess I realised I needed a new kernel to match too. So they're available in the kvm-hosting repository; take the kernel with "birthday" in its name - the other is more minimal and has no USB support, etc.

blog spam

Since I last reset the statistics the blog spam detector has reported, rejected, and refused just over half a million bogus comments.

It can and should do better.

I've been planning on overhauling this for some time; even to the extent of wondering if I can move the XML::RPC service into a C daemon with embedded lua/perl to do the actual analysis.

(Right now the whole service is Perl, but I'm a little suspicious of the XML::RPC library - my daemon dies at times and I don't understand why.)

I'd say "test suggestions welcome", but then I'd have to explain what is already done. If you're curious take a look at the code...

ObSubject: Hot Fuzz

 

Is my personal life of interest to you?

This weekend I mostly fiddled around migrating machines from Xen hosting to KVM hosting. Ultimately it was largely a waste of time, due to various other factors. Still with a bit of luck it will be possible to move the machiens next week.

That aside I spent a while updating my blogspam detection site. As a brief recap this site offers a simple XML-RPC service which allows you to test whether incoming blog comments are spam or not.

Originally this was put together to fight an invasion of comments submited to the Debian Administration website: The site currently shows:

SiteSpamNon-Spam% spam
debian-administration.org 238 372 60.98% spam

Depressing. But not as depressing as the real live stats which show since I last reset the counters 36,995 spam comments vs. 1,206 non-spam comments. (live updating counters here)

Anyway I updated the service today to add two new plugins, both of which are a little reactionary.

The first new plugin is called "multilink" and is based upon the observation that spammers rarely know the markup of the site they are submitting comments to. This means you can frequently see submitted comments like this:

 <a href="http://spam.com">buy viagra</a>
 [url=http://spam.com]buy viagra[/url]
 [link=http://spam.com]buy me[/link]

Here we have three different styles of links - "a href", "link=", and "url=". I figure this is a clear indicator of a confused mind, or more likely a spammer.

The second new plugin is designed to stop people who enter "<strong>" words. It is a little coarse but actuall zero false positives in the real world so I'm going to leave it live to see how it works out.

In happier news I'm just back from a trip to the beach. Sand rocks. Even if it wasn't windy enough for my kite ..

ObFilm: Dracula ("Bram Stoker's Dracula" - 1992)

 

You didn't faint. That's always a good sign

Joey has implemented blogspam support for ikiwiki, which you can find here.

He was also good enough to provide valuable & useful feedback so I can present a moderately updated API - most notably with the ability to train comments when/if they are incorrectly marked.

I can imagine several other useful additions, but mostly I think the code should be cleaned up (written in an evening, posted to keep myself honest, didn't really expect anybody to be terribly interested) and then it should just tick over until comment submitters adapt and improve to the extent that my sites suffer again ..

ObFilm: Kissed

 

Some of us are just better at hiding it, that's all.

Since I managed to get my blogspam plugin listed on wordpress.org I've seen a surge of interest.

In the past few hours capture stats are:

Good comments1
SPAM comments100

Almost exactly 1% of submitted comments are OK, and a SPAM rate of 99%. Either forum and blog spam are even more rife than I'd expected or I am over-zealous!

Anyway that's enough on this topic for a while, if you want to follow the work you know where to look and if you don't my writing about it again will just drive you mad..

ObFilm: The Breakfast Club

 

Named after a hot dog, you poor man, you poor, poor man.

I've updated the blogspam site a little. Now:

Still has a sucky site / layout...

ObFilm: Ghostbusters II

 

When the light is green, the trap is clean.

Recently the Debian Administration website has been besieged by spammers submitting bogus comments.

I also run a couple of blogs which are constantly having random gibberish submitted to them - although it is less of a problem there because I approve them manually due to the offline nature of my blog, and the blog compiler I use.

Regardless I figured it was time to do something about it, in part because Chris Searle suggested that I should be able to query arbitrary text for spammyness due to my experience with fighting email spam. (As it turns out that is not so useful experience. Comments and emails are not the same. The same header-trickery you see in bogus email, false EHLOs, etc, are not useful concepts to carry accross.)

Anyway we were saying, I was going to do something? So I did. My solution? An XML-RPC server which you submit your comments to. This will return either:

OKThe comment is clean.
SPAM:[reason]The comment is spam, and the optional reason explains further

The server itself is very simple, and it comes with a collection of simple plugins which do the job of testing different things. More plugins will almost certainly be added over time - so far I've seen the lotsaurls plugin capturing most of the SPAM, although the DNSRBL lookup is also working nicely having captured live spam without my touching it. Huzzah. & etc..

If there is any interest feel free to let me know, although as the server is open, and the client side usage is documented I imagine other people could run with it easily enough too - avoiding me becoming a single point of failure.

For reference my initial change to hook this up live on the site was only a few lines of code - although later I did need to make some additional changes to make it cope with failures, & etc.

ObFilm: Ghostbusters