Skip to content

Entries tagged "blogspam".

A partial perl-implementation of Redis

So recently I got into trouble running Redis on a host, because the data no-longer fits into RAM.

As an interim measure I fixed this by bumping the RAM allocated to the guest, but a real solution was needed. I figure there are three real alternatives:

  • Migrate to Postgres, MySQL, or similar.
  • Use an alternative Redis implementation.
  • Do something creative.

Looking around I found a couple of Redis-alternatives, but I was curious to see how hard it would be to hack something useful myself, as a creative solution.

This evening I spotted Protocol::Redis, which is a perl module for decoding/encoding data to/from a Redis server.

Thinking "Ahah" I wired this module up to AnyEvent::Socket. The end result was predis - A perl-implementation of Redis.

It's a limited implementation which stores data in an SQLite database, and currently has support for:

  • get/set
  • incr/decr
  • del/ping/info

It isn't hugely fast, but it is fast enough, and it should be possible to use alternative backends in the future.

I suspect I'll not add sets/hashes, but it could be done if somebody was keen.

 

Blogspam moved, redis alternatives being examined

As my previous post suggested I'd been running a service for a few years, using Redis as a key-value store.

Redis is lovely. If your dataset will fit in RAM. Otherwise it dies hard.

Inspired by Memcached, which is a simple key=value store, redis allows for more operations: using sets, using hashes, etc, etc.

As it transpires I mostly set keys to values, so it crossed my mind last night an alternative to rewriting the service to use a non-RAM-constrained service might be to juggle redis out and replace it with something else.

If it were possible to have a redis-compatible API which secretly stored the data in leveldb, sqlite, or even Berkley DB, then that would solve my problem of RAM-constraints, and also be useful.

Looking around there are a few projects in this area nds fork of redis, ssdb, etc.

I was hoping to find a Perl Redis::Server module, but sadly nothing exists. I should look at the various node.js stub-servers which exist as they might be easy to hack too.

Anyway the short version is that this might be a way forward, the real solution might be to use sqlite or postgres, but that would take a few days work. For the moment the service has been moved to a donated guest and has 2Gb of RAM instead of the paltry 512Mb it was running on previously.

Happily the server is installed/maintained by my slaughter tool so reinstalling took about ten minutes - the only hard part was migrating the Redis-contents, and that's trivial thanks to the integrated "slave of" support. (I should write that up regardless though.)

 

What do you do when your free service is too popular?

Once upon a time I setup a centralized service for spam-testing blog/forum-comments in real time, that service is BlogSpam.net.

This was created because the Debian Administration site was getting hammered with bogus comments, as was my personal blog.

Today the unfortunate thing happened, the virtual machine this service was running on ran out of RAM and died - The redis-store that holds all the state has now exceeded the paltry 512Mb allocated to the guest, so OOM killed it.

So I'm at an impasse - I either recode it to use MySQL instead of Redis, or something similar to allow the backing store to exceed the RAM-size, or I shut the thing down.

There seems to be virtually no liklihood somebody would sponsor a host to run the service, because people just don't pay for this kind of service.

I've temporarily given the guest 1Gb of RAM, but that comes at a cost. I've had to shut down my "builder" host - which is used to build Debian packages via pbuilder.

Offering an API, for free, which has become increasingly popular and yet equally gets almost zero feedback or "thanks" is a bit of a double-edged sword. Because it has so many users it provides a better service - but equally costs more to run in terms of time, effort, and attention.

(And I just realized over the weekend that my Flattr account is full of money (~50 euro) that I can't withdraw - since I deleted my paypal account last year. Ooops.)

Meh.

Happy news? I'm avoiding the issue of free service indefinitely with the git-based DNS product which was covering costs and now is .. doing better. (20% off your first months bill with coupon "20PERCENT".)

 

So load-balancers are awesome

When I was recently talking about load-balancers, and automatically adding back-ends, not just removing bad ones that go offline, I obviously spent a while looking over some.

There are several dedicated load-balancers packaged for Debian GNU/Linux, including:

In addition to actual dedicated load-balancers there are things that can be coerced into running in that way: apache2, varnish, squid, nginx, & etc.

Of the load-balancers I was immediately drawn to both pen and pound, because they have command line tools ("penctl" and "poundctl" respectively) for adding/removing/updating the running configuration.

Pen I've been using for a couple of days now, and although it suffers from some security issues I'm confident they will be resolved in the near future. (#741370)

My only outstanding task is to juggle some hosts around and stress-test the pair of them a little more before deciding on a winner.

In other news I kinda regret the whole blogspam.net API. I'd have had a far simpler life if I'd just ran the damn thing as a DNSBL in the first place. (That's essentially how it operates on the whole anyway. Submit spammy comments for long enough and you're just blacklisted, thereafter.)

 

A beginning is a very delicate time.

Recently I wrote about docker, after a brief diversion into using runit for service management, I then wrote about it some more.

I'm currently setting up a new PXE-boot environment which uses docker for serving DHCP and TFTPD, which is my first "real" usage of any note. It is fun, although I now discover I'm not alone in using docker for this purpose.

Otherwise life is good, and my blog-spam detection service recently broke through the 11 million-rejected-comment barrier. The Wordpress Plugin is seeing a fair amount of use, which is encouraging - but more reviews would be nice ;)

I could write about work, I've not done that since changing job, but I'm waiting for something disruptive to happen first..

ObQuote: Dune. (film)

 

A difficult day

Today was my last day working at Bytemark, and I found it a lot harder than expected.

For better or worse I finished earlier than expected; having been gradually removing my accounts and privileges over the past few weeks I'd revoked my OpenVPN key this morning.

Mid-afternoon my openvpn connection tried to renegotiate session keys, or similar, and failed. So I stopped work a few hours early. That meant I managed to avoid sending my "goodbye world" email, which is probably for the best - after all a lovely company, lovely people, and a good environment, what can you say besides things that are lovely?

I think I largely wrapped things up neatly, and I'm pleased that one of my photos is hanging on the office wall. (I look forward to seeing that actually, I've only rarely made canvas prints.)

The only other thing of note this week has been the sharp rise in blogspam I've detected. Black Friday alive and well, on the internets ..

 

Some thoughts ..

It has taken just over two weeks for blogspam to reject 1 million SPAM comments.

I'm not sure how paranoid I should be about false-positives now, (I accept false-negatives easily enough).

Using node.js is pretty good for making toy servers, and on that basis here's another toy server:

This is a small server which is designed to accept HTTP-POSTs containing a payload of a message, these are stored and later retrieved. Seems like a simple thing, right? Imagine how it is used:

root@server1:~# record-log Upgraded mysql

root@server2:~# record-log Tweaked /etc/sysctl.conf

root@server3:~# record-log Added user 'bob'
root@server3:~# record-log Added user 'steve'

Later:

root@server3:~# get-recent
1.2.3.4 2013-09-28T08:08:09.211Z
root:Added user 'bob'

1.2.3.4 2013-09-28T08:08:10.211Z
root:Added user 'steve'

In short it makes it easy to record "activity", and later retrieve it. A host can only fetch the entries it stored, but if you've got access to the remote server then you can get all logs.

I suspect a more standard solution is to use syslog-ng, and logger, or similar. But it is a cute hack and I suspect if you've the discipline to record actions then this is actually reasonably useful.

 

A new wordpress plugin

There is now a new wordpress plugin, for testing against my blogspam site/service.

Now time to talk about something else.

This week my partners sister & niece are visiting from Helsinki, so we've both got a few days off work, and we'll be acting like tourists.

Otherwise the job of this week is to find a local photographer to shoot the pair of us. I've shot her many, many, many times, and we also have many nice pictures of me but we have practically zero photos of the pair of us.

I spent a lot of time talking to local volunteers & models, because I like to shoot them, but I know only a couple of photographers.

Still a big city, we're bound to find somebody suitable :)

 

CIDR-matching, in node.js

I recently mentioned that there wasn't any built-in node.js functionality to perform IP matching against CIDR ranges.

This surprised me, given that lots of other functionality is available by default.

As a learning experience I've hacked a simple cidr-matching module, and published it as an NPM module.

I've written a few other javascript "libraries", but this is the first time I've published a module. Happy times.

The NPM documentation was pretty easy to follow:

  • Write a package.json file.
  • Run "npm publish".
  • Wait for time to pass, and Thorin to sit down and sing about gold.

Now I can take a rest, and stop talking about blog-spam.

 

The blogspam code is live.

Living dangerously I switched DNS to point to the new codebase on my lunch hour.

I found some problems immediately; but nothing terribly severe. Certainly nothing that didn't wait until I'd finished work to attend to.

I've spent an hour or so documenting the new API this evening, and now I'm just going to keep an eye on things over the next few days.

The code is faster, definitely. The load is significantly lower than it would have been under the old codebase - although it isn't a fair comparison:

  • I'm using redis to store IP-blacklists, which expire after 48 hours. Not the filesystem.
  • The plugins are nice and asynchronous now.
  • I've not yet coded a "bayasian filter", but looking at the user-supplied options that's the plugin that everybody seems to want to disable. So I'm in no rush.

The old XML-RPC API is still present, but now it just proxies to the JSON-version, which is a cute hack. How long it stays alive is an open question, but at least a year I guess.

God knows what my wordpress developer details are. I suspect its not worth my updating the wordpress plugin, since nobody ever seemed to love it.

These days the consumers of the API seem to be, in rough order of popularity:

  • Drupal.
  • ikiwiki.
  • Trac

There are few toy-users, like my own blog, and a few other similar small blogs. All told since lunchtime I've had hits from 189 distinct sources, the majority of which don't identify themselves. (Tempted to not process their requests in the future, but I don't think I can make such a change now without pissing off the world. Oops.)

PS. Those ~200 users? rejected 12,000 spam comments since this afternoon. That's cool, huh?

 

I've always relied upon the kindness of strangers

Many thanks to Vincent Meurisse who solved my node.js callback woe.

Some history of the blogspam service:

Back in 2008 I was annoyed by the many spam-comments that were being submitted to my Debian Administration website. I added some simple anti-spam measures, which reduced the flow, but it was a losing battle.

In the end I decided I should test comments, as the users submitted them, via some kind of external service. The intention being that any improvements to that central service would benefit all users. (So I could move to testing comments on my personal blog too, for example).

Ultimately I registered the domain-name "blogspam.net", and set up a simple service on it which would test comments and judge them to be "SPAM" or "OK".

The current statistics show that this service has stopped 20 million spam comments, since then. (We have to pretend I didn't wipe the counters once or twice.)

I've spent a while now re-implementing most of the old plugins in node.js, and I think I'll be ready to deploy the new service over the weekend. The new service will have to handle two different kinds of requests:

New Requests

These will be submitted via HTTP POSTed JSON data, and will be handled by node.js. These should be nice and fast.

Legacy Requests

These will come in via XML-RPC, and be proxied through the new node.js implementation. Hopefully this will mean existing clients won't even notice the transition.

I've not yet deployed the new code, but it is just a matter of time. Hopefully being node.js based and significantly easier to install, update, and tweak, I'll get more contributions too. The dependencies are happily very minimal:

  • A redis-server for maintaining state:
    • The number of SPAM/OK comments for each submitting site.
    • An auto-expiring cache of blacklisted IP adddresses. (I cache the results of various RBL results for 48 hours).
  • node.js

The only significant outstanding issue is that I need to pick a node.js library for performing CIDR lookups - "Does 10.11.12.23 lie within 10.11.12.0/24?" - I'm surprised that functionality isn't available out of the box, but it is the only omission I've missed.

I've been keeping load & RAM graphs, so it will be interesting to see how the node.js service competes. I expect that if clients were using it, in preference to the XML-RPC version, then I'd get a hell of a lot more throughput, but with it hidden behind the XML-RPC proxy I'm less sure what will happen.

I guess I also need to write documentation for the new/preferred JSON-based API...

https://github.com/skx/blogspam.js

 

node.js is kicking me

Today I started hacking on a re-implementation of my BlogSpam service - which tests that incoming comments are SPAM/HAM - in node.js (blogspam.js)

The current API uses XML::RPC and a perl server, along with a list of plugins, to do the work.

Having had some fun and success with the HTTP+JSON mstore toy I figured I'd have a stab at making BlogSpam more modern:

  • Receive a JSON body via HTTP-POST.
  • Deserialize it.
  • Run the body through a series of Javascript plugins.
  • Return the result back to the caller via HTTP status-code + text.

In theory this is easy, I've hacked up a couple of plugins, and a Perl client to make a submission. But sadly the async-stuff is causing me .. pain.

This is my current status:

shelob ~/git/blogspam.js $ node blogspam.js
Loaded plugin: ./plugins/10-example.js
Loaded plugin: ./plugins/20-ip.js
Loaded plugin: ./plugins/80-sfs.js
Loaded plugin: ./plugins/99-last.js
Received submission: {"body":"

This is my body ..

","ip":"109.194.111.184","name":"Steve Kemp"} plugin 10-example.js said next :next plugin 20-ip.js said next :next plugin 99-last.js said spam SPAM: Listed in StopForumSpam.com

So we've loaded plugins, and each has been called. But the end result was "SPAM: Listed .." and yet the caller didn't get that result. Instead the caller go this:

shelob ~/git/blogspam.js $ ./client.pl
200 OK 99-last.js

The specific issue is that I iterate over every loaded-plugin, and wait for them to complete. Because they complete asynchronously the plugin which should be last, and just return "OK" , has executed befure the 80-sfs.js plugin. (Which makes an outgoing HTTP request).

I've looked at async, I've looked at promises, but right now I can't get anything working.

Meh.

Surprise me with a pull request ;)

 

This week in brief

This week in brief:

I've rejoined the Debian Security Team

My first (recent) DSA was released earlier today, with assistance from various team members. (My memory of the process was poor, and some things have changed in my absence.)

BlogSpam gains a new user

The BlogSpam API is now available for users of Trac.

Finally, before I go, I've noticed several people on Planet Debian report their photo-challenges; either a picture a day or one a week. I too take pictures, and I'm happy if I get one session a month.

I suspect some of my content might be a bit too racy for publication here. If you're not avoiding friendface-style sites you can follow "highlights" easily enough - or just look at the site.

ObQuote: "Be strong and you will be renewed. Identify. " - Logan's Run (1976)

 

A new Blog::Spam release.

I've just released a new Blog::Spam module to CPAN.

The Blog::Spam module is 99% of the code behind the blog & forum spam detection service.

On the whole I'm pleased with the way that development has gone, and although I'm going to keep tinkering it is essentially "complete".

Now I need to go back and fix bugs in chronicle, asql, and rinse.

ObSubject: "You are just an ordinary man in a cape!" - Batman Begins.

 

As promised a new blogspam.net

A while back I mentioned that I was going to be updating and overhauling the blogspam.net service. That process is now almost complete. A couple of nights ago I overhauled the website, and today I've finally commited my last (planned) change to the repository for the purposes of migration. I started reworking the code a week or so ago, but as of this evening the code in the repository is the code the server is actually running.

The previous codebase was functional but a little hasty - and was implemented before I switched to per-UID server-hosting - so there was a need to clean things up and make sure permissions and similar niggles were checked.

The new, modular, codebase requires no root access, and will store all state (logs & transient caches) in a clean extensible fasion. The code is also much more flexible making use of Module::Pluggable rather than Class::Pluggable. This allowed me to overhaul the API of the plugins (primarily to add an expire method such that each plugin has a well-defined means to expire any state they may maintain). Module::Pluggable is a great module - allows me to treat plugins as first class objects, which wasn't the case with C::P.

Since all the code behind the service is Perl it is also now available on CPAN in addition to the mercurial repository where it is developed..

I see that the server is getting pretty popular these days, used by the likes of embedders.org, publiclive.com, & etc. It doesn't hurt that ikiwiki, identi.ca, and other people include support in their distributions these days. Me? I mostly use it on debian-administration.org where it does a great job.

ObQuote: What's the name of that thing that if I eat it real fast, it's free? - Whip It.

 

You do know there are more guns in the country than there are in the city.

Lenny Backports

After a couple of days I've spotted a few things that don't work so well on Lenny:

gtk-gnutella

gtk-gnutella is a client for a peer-to-peer filesharing system. Unfortunately the version of the client in Lenny dies on startup "This version is too old to connect".

gimp

The graphics program, The Gimp, doesn't show a live preview when carrying out things such as colour desaturation.

Although not an insurmountable problem it is moderately annoying if you do such things often.

So I've placed backported packages online.

I expected to have to backport KVM, and I guess I realised I needed a new kernel to match too. So they're available in the kvm-hosting repository; take the kernel with "birthday" in its name - the other is more minimal and has no USB support, etc.

blog spam

Since I last reset the statistics the blog spam detector has reported, rejected, and refused just over half a million bogus comments.

It can and should do better.

I've been planning on overhauling this for some time; even to the extent of wondering if I can move the XML::RPC service into a C daemon with embedded lua/perl to do the actual analysis.

(Right now the whole service is Perl, but I'm a little suspicious of the XML::RPC library - my daemon dies at times and I don't understand why.)

I'd say "test suggestions welcome", but then I'd have to explain what is already done. If you're curious take a look at the code...

ObSubject: Hot Fuzz

 

Is my personal life of interest to you?

This weekend I mostly fiddled around migrating machines from Xen hosting to KVM hosting. Ultimately it was largely a waste of time, due to various other factors. Still with a bit of luck it will be possible to move the machiens next week.

That aside I spent a while updating my blogspam detection site. As a brief recap this site offers a simple XML-RPC service which allows you to test whether incoming blog comments are spam or not.

Originally this was put together to fight an invasion of comments submited to the Debian Administration website: The site currently shows:

SiteSpamNon-Spam% spam
debian-administration.org 238 372 60.98% spam

Depressing. But not as depressing as the real live stats which show since I last reset the counters 36,995 spam comments vs. 1,206 non-spam comments. (live updating counters here)

Anyway I updated the service today to add two new plugins, both of which are a little reactionary.

The first new plugin is called "multilink" and is based upon the observation that spammers rarely know the markup of the site they are submitting comments to. This means you can frequently see submitted comments like this:

 <a href="http://spam.com">buy viagra</a>
 [url=http://spam.com]buy viagra[/url]
 [link=http://spam.com]buy me[/link]

Here we have three different styles of links - "a href", "link=", and "url=". I figure this is a clear indicator of a confused mind, or more likely a spammer.

The second new plugin is designed to stop people who enter "<strong>" words. It is a little coarse but actuall zero false positives in the real world so I'm going to leave it live to see how it works out.

In happier news I'm just back from a trip to the beach. Sand rocks. Even if it wasn't windy enough for my kite ..

ObFilm: Dracula ("Bram Stoker's Dracula" - 1992)

 

You didn't faint. That's always a good sign

Joey has implemented blogspam support for ikiwiki, which you can find here.

He was also good enough to provide valuable & useful feedback so I can present a moderately updated API - most notably with the ability to train comments when/if they are incorrectly marked.

I can imagine several other useful additions, but mostly I think the code should be cleaned up (written in an evening, posted to keep myself honest, didn't really expect anybody to be terribly interested) and then it should just tick over until comment submitters adapt and improve to the extent that my sites suffer again ..

ObFilm: Kissed

 

Some of us are just better at hiding it, that's all.

Since I managed to get my blogspam plugin listed on wordpress.org I've seen a surge of interest.

In the past few hours capture stats are:

Good comments1
SPAM comments100

Almost exactly 1% of submitted comments are OK, and a SPAM rate of 99%. Either forum and blog spam are even more rife than I'd expected or I am over-zealous!

Anyway that's enough on this topic for a while, if you want to follow the work you know where to look and if you don't my writing about it again will just drive you mad..

ObFilm: The Breakfast Club

 

Named after a hot dog, you poor man, you poor, poor man.

I've updated the blogspam site a little. Now:

Still has a sucky site / layout...

ObFilm: Ghostbusters II

 

When the light is green, the trap is clean.

Recently the Debian Administration website has been besieged by spammers submitting bogus comments.

I also run a couple of blogs which are constantly having random gibberish submitted to them - although it is less of a problem there because I approve them manually due to the offline nature of my blog, and the blog compiler I use.

Regardless I figured it was time to do something about it, in part because Chris Searle suggested that I should be able to query arbitrary text for spammyness due to my experience with fighting email spam. (As it turns out that is not so useful experience. Comments and emails are not the same. The same header-trickery you see in bogus email, false EHLOs, etc, are not useful concepts to carry accross.)

Anyway we were saying, I was going to do something? So I did. My solution? An XML-RPC server which you submit your comments to. This will return either:

OKThe comment is clean.
SPAM:[reason]The comment is spam, and the optional reason explains further

The server itself is very simple, and it comes with a collection of simple plugins which do the job of testing different things. More plugins will almost certainly be added over time - so far I've seen the lotsaurls plugin capturing most of the SPAM, although the DNSRBL lookup is also working nicely having captured live spam without my touching it. Huzzah. & etc..

If there is any interest feel free to let me know, although as the server is open, and the client side usage is documented I imagine other people could run with it easily enough too - avoiding me becoming a single point of failure.

For reference my initial change to hook this up live on the site was only a few lines of code - although later I did need to make some additional changes to make it cope with failures, & etc.

ObFilm: Ghostbusters