Steve Kemp's Blog Writings relating to Debian & Free Software

Accidental data-store .. is go!

Thursday, 19 May 2016

A couple of days ago I wrote::

The code is perl-based, because Perl is good, and available here on github:


TODO: Rewrite the thing in #golang to be cool.

I might not be cool, but I did indeed rewrite it in golang. It was quite simple, and a simple benchmark of uploading two million files, balanced across 4 nodes worked perfectly.



Accidental data-store ..

Wednesday, 18 May 2016

A few months back I was looking over a lot of different object-storage systems, giving them mini-reviews, and trying them out in turn.

While many were overly complex, some were simple. Simplicity is always appealing, providing it works.

My review of camlistore was generally positive, because I like the design. Unfortunately it also highlighted a lack of documentation about how to use it to scale, replicate, and rebalance.

How hard could it be to write something similar, but also paying attention to keep it as simple as possible? Well perhaps it was too easy.


First of all we write a blob-storage system. We allow three operations to be carried out:

  • Retrieve a chunk of data, given an ID.
  • Store the given chunk of data, with the specified ID.
  • Return a list of all known IDs.


API Server

We write a second server that consumers actually use, though it is implemented in terms of the blob-storage server listed previously.

The public API is trivial:

  • Upload a new file, returning the ID which it was stored under.
  • Retrieve a previous upload, by ID.


Replication Support

The previous two services are sufficient to write an object storage system, but they don't necessarily provide replication. You could add immediate replication; an upload of a file could involve writing that data to N blob-servers, but in a perfect world servers don't crash, so why not replicate in the background? You save time if you only save uploaded-content to one blob-server.

Replication can be implemented purely in terms of the blob-servers:

  • For each blob server, get the list of objects stored on it.
  • Look for that object on each of the other servers. If it is found on N of them we're good.
  • If there are fewer copies than we like, then download the data, and upload to another server.
  • Repeat until each object is stored on sufficient number of blob-servers.


My code is reliable, the implementation is almost painfully simple, and the only difference in my design is that rather than having an API-server which allows both "uploads" and "downloads" I split it into two - that means you can leave your "download" server open to the world, so that it can be useful, and your upload-server can be firewalled to only allow a few hosts to access it.

The code is perl-based, because Perl is good, and available here on github:

TODO: Rewrite the thing in #golang to be cool.



Recycling old ideas ..

Saturday, 9 April 2016

My previous blog post was about fuzzing and finding segfaults in GNU Awk. At the time of this update they still remain unfixed.

Reading about a new release of mutt I've seen a lot of complaints about how it handles HTML mail, by shelling out to lynx or w3m. As I have a vested interest in console based mail-clients I wanted to have a quick check to see how dangerous that could be. After all it wasn't so long ago that I discovered that printing a fingerprint of an SSH key could be dangerous, so the idea of parsing untrusted HTML is something I could see.

In fact back in 2005 I reported that some specific HTML could crash Mozilla's firefox. Due to some ordering issues my Firefox bug was eventually reported as a duplicate, and although it seemed to qualify for the Mozilla bug-bounty and a CVE assignment I never received any actual cash. Shame. I'd have been more interested in testing the browser if I had a cheque to hang on my wall (and never cash).

Anyway full-circle. Fuzzing the w3m console-based browser resulted in a bunch of segfaults when running this:

 w3m -dump $file.html

Anyway each of the two bugs I reported were fixed in a day or two, and both involved gnarly UTF-8/encoding transformations. Many thanks to Tatsuya Kinoshita for such prompt attention and excellent debugging skills.

And lynx? Still no segfaults. I'll leave the fuzzer running over the weekend and if there are no faults found by Monday I guess I'll move on to links.



If line-noise is a program, all fuzzers are developers

Monday, 29 February 2016

Recently I had a conversation with a programmer who repeated the adage that programming in perl consists of writing line-noise. This isn't true but it reminded me of my love of fuzzers. Fuzzers are often used to generate random input files which are fed to tools, looking for security problems, segfaults, and similar hilarity.

To the untrained eye the output of most fuzzers is essentially line-noise, since you often start with a valid input file and start flipping bits, swapping bytes, and appending garbage.

Anyway this made me wonder what happens if you fed random garbage into a perl interpreter? I wasn't brave enough to try it, because knowing my luck the fuzzer would write a program like so:

system( "rm -rf /home/steve" );

But I figured it was still an interesting idea, and I could have a go at fuzzing something else. I picked gawk, the GNU implementation of awk because the codebase is pretty small, and I understand it reasonably well.

Almost immediately my fuzzer found some interesting segfaults and problems. Here's a nice simple example:

 $ gawk 'for (i = ) in steve kemp rocks'
 gawk: cmd. line:1: fatal error: internal error: segfault

I look forward to seeing what happens when other people fuzz perl..



Redesigning my clustered website

Sunday, 7 February 2016

I'm slowly planning the redesign of the cluster which powers the Debian Administration website.

Currently the design is simple, and looks like this:

In brief there is a load-balancer that handles SSL-termination and then proxies to one of four Apache servers. These talk back and forth to a MySQL database. Nothing too shocking, or unusual.

(In truth there are two database servers, and rather than a single installation of HAProxy it runs upon each of the webservers - One is the master which is handled via ucarp. Logically though traffic routes through HAProxy to a number of Apache instances. I can lose half of the servers and things still keep running.)

When I setup the site it all ran on one host, it was simpler, it was less highly available. It also struggled to cope with the load.

Half the reason for writing/hosting the site in the first place was to document learning experiences though, so when it came to time to make it scale I figured why not learn something and do it neatly? Having it run on cheap and reliable virtual hosts was a good excuse to bump the server-count and the design has been stable for the past few years.

Recently though I've begun planning how it will be deployed in the future and I have a new design:

Rather than having the Apache instances talk to the database I'll indirect through an API-server. The API server will handle requests like these:

  • POST /users/login
    • POST a username/password and return 200 if valid. If bogus details return 403. If the user doesn't exist return 404.
  • GET /users/Steve
    • Return a JSON hash of user-information.
    • Return 404 on invalid user.

I expect to have four API handler endpoints: /articles, /comments, /users & /weblogs. Again we'll use a floating IP and a HAProxy instance to route to multiple API-servers. Each of which will use local caching to cache articles, etc.

This should turn the middle layer, running on Apache, into simpler things, and increase throughput. I suspect, but haven't confirmed, that making a single HTTP-request to fetch a (formatted) article body will be cheaper than making N-database queries.

Anyway that's what I'm slowly pondering and working on at the moment. I wrote a proof of concept API-server based CMS two years ago, and my recollection of that time is that it was fast to develop, and easy to scale.



Best practice - Don't serve writeable PHP files

Tuesday, 2 February 2016

I deal with compromises often enough of PHP-based websites that I wish to improve hardening.

One obvious way to improve things is to not serve PHP files which are writeable by the webserver-user. This would ensure that things like wp-content/uploads didn't get served as PHP if a compromise wrote valid PHP there.

In the past using php5-suhosin would have allowd this via the suhosin.executor.include.allow_writable_files flag.

Since suhosin is no longer supported under Debian Jessie I wonder if there is a simple way to achieve this?

I've written a toy-module which allows me to call stat on every request, and return a 403 on access to writeable files/directories. But it seems like I shouldn't need to write my own code for this functionality.

Any pointers welcome; happy to post my code if that is useful but suspect not - it just shouldn't exist.



So life in Finland goes on

Wednesday, 20 January 2016

So after living here in Finland for 6 months I've now bought a flat.

We have a few days to sort out mortgage paperwork, and assuming there are no problems we'll be moving into the new place on/around the 1st of March.

Finally I'll be living in Finland, with a sauna of my very own.

Interesting times.

In more developer-friendly news I made a new release of Lumail with the integrated support for IMAP. Let us hope people like it.



Lumail has IMAP .. almost

Saturday, 16 January 2016

A couple of years ago I was dissatisfied with mutt, mostly because the mutt-sidebar patch was dropped from the Debian package. That lead to me thinking "How hard can it be to write a modal, console-based mail-client?"

It turns out writing a client is pretty simple if you limit yourself solely to Maildirs, and as I typically read my mail over SSH on the mailhost itself that suited me pretty well.

Recently I restarted the mail-client. Putting it together from scratch to simplify the implementation, and unify a lot of the adhoc scripting which is provided by Lua. People seem to like the client, but the single largest complaint was "Can't use it - no IMAP."

This week I've mostly been adding IMAP support, and today I'll commit the last few bits that mean it is roughly-functional:

  • Connecting to a mail-server works.
  • Getting the folders works.
  • Getting the messages works.

The outstanding niggles will be relating to getting/setting the new/read/seen/unseen flags, and similar. But I'm pleased that the job wasn't insurmountable.

I've used libcurl to provide the IMAP functionality because most of the IMAP libraries I looked at were big, scary, and complex. Using curl to access IMAP is pretty neat, simple, and straightforward. The downside is you're making a lot of "http" requests. So I might need to revisit things.

Happily my imap wrapper doesn't need much functionality. So if I can find a better library swapping it out will be simple.

In conclusion: Lumail almost has IMAP support, and that might mean it'll be more useful to others.

| 1 comment.


Restoring my system .. worked

Saturday, 2 January 2016

A while back I wrote about some issues with converting a two-disk RAID system to a one-disk system, but just to recap:

  • We knew were were moving to Finland.
  • The shared/main computer we used in the UK was old and slow.
  • A new computer in Finland would be more expensive than it should be.
  • Equally transporting a big computer from the UK would also be silly.

In the end we bought a small form-factor PC, with only a single drive and I moved one of the two drives from the old machine into it. Then converted it to run happily with only a single drive, and not email every day to say "device missing".

So there things stood, we had a desktop with a single drive, and I ensured that I took full daily backup via attic.

Over Chrismas the two-year old drive failed. To the extent I couldn't even get it to be recognized by the BIOS, and thus couldn't pull data off it. Time to test my backups in anger! I bought a new drive, installed a minimal installation of the Jessie release of Debian onto the system, and then ran:

 cd /
 .. restore latest backup ..

Two days later I'd pulled 1.3Tb over the network, and once I fixed up grub, /etc/fstab, and a couple of niggles it all just worked. Rebooted to make sure the temporary.home hostname, etc, was all gone and life was good.

Restored backup! No errors! No data-loss! Perfect!

The backup-script I use every day was very very good at making sure nothing was missed:

attic create --stats --checkpoint-interval=7200 attic@${remote}:/attic/storage::${host}-$(date +%Y-%m-%d-%H)
  --exclude=/proc      \
  --exclude=/sys       \
  --exclude=/run       \
  --exclude=/dev       \
  --exclude=/tmp       \
  --exclude=/var/tmp   \
  --exclude=/var/log   \

In other news I published my module for controlling the new smart lights I've bought

| No comments


I joined the internet of things.

Wednesday, 30 December 2015

In my old flat I had a couple of simple radio-controlled switches, which allowed me to toggle power to a pair of standing lamps - one at each side of the bed. This was very lazy, but also really handy and I've always been curious about automation..

When it comes to automation there seems to be three main flavours:


The original standard, with stuff produced by many vendors and good Linux support.

X10 supports two ways of sending/receiving commands - over the electrical wiring, and over RF.


This is the newcomer, which despite that seems to be well-supported and extensible. It allows "measurements" to be sent/received in addition to the broadcast of events like "switch on", and "switch off".

Other systems - often lighting-centric

There are toy-things like the previously noted power-controlling things, there are also stand-alone devices from people like Philips with their philips hue system, but given how Philips recently crippled their devices to disable third-party bulbs I've no desire to use them.

One company caught my eye though, Osram make a smart lightbulb and mini-hub to work with it.

So I bought one of the osram lightify systems, consisting of a magic box and a pair of lightbulbs. The box connects to your wifi, and gets an IP address. The IP address is then used by the application on your mobile phone (i.e. the magic box does the magic, not the bulbs). The phone application can be used to trigger "on", "off", "dim", "brighter", and the various colour-changing commands, as you would expect.

You absolutely must use the phone-based application to do the setup, but after that the whole point was that I could automate things. I wanted to be able to setup my desktop computer to schedule events, and started hacking.

I've written a simple Perl module to let me discover bulbs, and turn them off and on. No doubt it'll be on CPAN in the near future, once I can pick a suitable name for it:

$ ol --bridge= --list
hall       MAC:8418260000d9c70c RGBW:255,255,255,255 STATE:On
kitchen    MAC:8418260000cb433b RGBW:255,255,255,255 STATE:On

$ ol --bridge= --off=kitchen

$ ol --bridge= --list
hall       MAC:8418260000d9c70c RGBW:255,255,255,255 STATE:On
kitchen    MAC:8418260000cb433b RGBW:255,255,255,255 STATE:Off

The only niggle was the fiddly pairing, and the lack of any decent documentation. The code I wrote was loosely based on the python project python-lightify written by Mikael Magnusson. Also worth noting that the bridge/magic-box only exposes a single port so you can find the device on your VLAN by nmapping for port 4000:

$ nmap -v -p 4000

The device doesn't seem to allow any network setup at all - it only uses DHCP. So you might want to make sure it gets assigned a stable IP.

Anyway I'm going to bed. When I do so I'll turn the lights off with my mobile phone. Neat.

In the future I will look at more complex automation, and I think Z-wave is the way I'll go. Right now I'm in a rented flat so replacing wall-switches, etc, is something I can't do. But the systems I've looked at seem neat, and this current setup will keep me amused for several months!



Spiral Logo


Recent Posts

Recent Tags


RSS Feed

  • Subscribe to feed