Skip to content

Is lumail a stepping stone?

I'm pondering a rewrite of my console-based mail-client.

While it is "popular" it is not popular.

I suspect "console-based" is the killer.

I like console, and I ssh to a remote server to use it, but having different front-ends would be neat.

In the world of mailpipe, etc, is there room for a graphic console client? Possibly.

The limiting factor would be the lack of POP3/IMAP.

Reworking things such that there is a daemon to which a GUI, or a console client, could connect seems simple. The hard part would obviously be working the IPC and writing the GUI. Any toolkit selected would rule out 40% of the audience.

In other news I'm stalling on replying to emails. Irony.

 

Putting the finishing touches to a nodejs library

For the past few years I've been running a simple service to block blog/comment-spam, which is (currently) implemented as a simple JSON API over HTTP, with a minimal core and all the logic in a series of plugins.

One obvious thing I wasn't doing until today was paying attention to the anchor-text used in hyperlinks, for example:

  <a href="http://fdsf.example.com/">buy viagra</a>

Blocking on the anchor-text is less prone to false positives than blocking on keywords in the comment/message bodies.

Unfortunately there seem to exist no simple nodejs modules for extracting all the links, and associated anchors, from a random Javascript string. So I had to write such a module, but .. given how small it is there seems little point in sharing it. So I guess this is one of the reasons why there often large gaps in the module ecosystem.

(Equally some modules are essentially applications; great that the authors shared, but virtually unusable, unless you 100% match their problem domain.)

I've written about this before when I had to construct, and publish, my own cidr-matching module.

Anyway expect an upload soon, currently I "parse" HTML and BBCode. Possibly markdown to follow, since I have an interest in markdown.

 

A small assortment of content

Today I took down my KVM-host machine, rebooting it and restarting all of my guests. It has been a while since I'd done so and I was a little nerveous, as it turned out this nerveousness was prophetic.

I'd forgotten to hardwire the use of proxy_arp so my guests were all broken when the systems came back online.

If you're curious this is what my incoming graph of email SPAM looks like:

I think it is obvious where the downtime occurred, right?

In other news I'm awaiting news from the system administration job I applied for here in Edinburgh, if that doesn't work out I'll need to hunt for another position..

Finally I've started hacking on my console based mail-client some more. It is a modal client which means you're always in one of three states/modes:

  • maildir - Viewing a list of maildir folders.
  • index - Viewing a list of messages.
  • message - Viewing a single message.

As a result of a lot of hacking there is now a fourth mode/state "text-mode". Which allows you to view arbitrary text, for example scrolling up and down a file on-disk, to read the manual, or viewing messages in interesting ways.

Support is still basic at the moment, but both of these work:

  --
  -- Show a single file
  --
  show_file_contents( "/etc/passwd" )
  global_mode( "text" )

Or:

function x()
   txt = { "${colour:red}Steve",
           "${colour:blue}Kemp",
           "${bold}Has",
           "${underline}Definitely",
           "Made this work" }
   show_text( txt )
   global_mode( "text")
end

x()

There will be a new release within the week, I guess, I just need to wire up a few more primitives, write more of a manual, and close some more bugs.

Happy Thursday, or as we say in this house, Hyvää torstai!

 

So that distribution I'm not-building?

The other week I was toying with using GNU stow to build an NFS-share, which would allow remote machines to boot from it.

It worked. It worked well. (Standard stuff, PXE booting with an NFS-root.)

Then I started wondering about distributions, since in one sense what I'd built was a minimal distribution.

On that basis yesterday I started hacking something more minimal:

  • I compiled a monolithic GNU/Linux kernel.
  • I created a minimal initrd image, using busybox.
  • I built a static version of the tcc compiler.
  • I got the thing booting, via KVM.

Unfortunately here is where I ran out of patience. Using tcc and the static C library I can compile code. But I can't link it.

$ cat > t.c <>EOF
int main ( int argc, char *argv[] )
{
        printf("OK\n" );
        return 1;
}
EOF
$ /opt/tcc/bin/tcc t.c
tcc: error: file 'crt1.o' not found
tcc: error: file 'crti.o' not found
..

Attempting to fix this up resulted in nothing much better:

$ /opt/tcc/bin/tcc t.c -I/opt/musl/include -L/opt/musl/lib/

And because I don't have a full system I cannot compile t.c to t.o and use ld to link (because I have no ld.)

I had a brief flirt with the portable c-compiler, pcc, but didn't get any further with that.

I suspect the real solution here is to install gcc onto my host system, with something like --prefix=/opt/gcc, and then rsync that into my (suddenly huge) intramfs image. Then I have all the toys.

 

Tagging images, and maintaining collections?

I'm an amateur photographer, although these days I tend to drop the amateur prefix, given that I shoot people for cash at least once a month.

(It isn't my main job, and I'd never actually want it to be, because I'm certain I'd become unhappy hustling for jobs and doing the promotion thing.)

Anyway over the years I've built up a large library of images, mostly organized in a hierarchy of directories beneath ~/Images.

Unlike most photographers I don't use aperture, lighttable, or any similar library management. I shoot my images in RAW, convert to JPG via rawtherapee, and keep both versions of the images.

In short I don't want to mix the "library management" functions with the "RAW conversion" because I do regard them as two separate steps. That said I'm reaching a point where I do want to start tagging images, and finding them more quickly.

In the past I wrote a couple of simple tools to inject tags into the EXIF data of images, and then indexed them. But that didn't work so well in practise. I'm starting to think instead I should index images into sqlite:

  • Size.
  • date.
  • Content hash.
  • Tags.
  • Path.

The downside is that this breaks utterly as soon as you move images around on-disk. Which is something my previous exif-manipulation was designed to avoid.

Anyway I'm thinking at the moment, but I know that the existing tools such as F-Spot, shotwell, DigiKam, and similar aren't suitable. So I either need to go standalone and use EXIF tags, accepting the fact that the tags I enter won't be visible to other tools, or I cope with the file-rename issues by attempting to update an existing sqlite database via hash/size/etc.

 

Some things on DNS and caching

Although there wasn't too many comments on my what would you pay for? post I did get some mails.

I was reminded about this via Mario Langs post, which echoed a couple of private mails I received.

Despite being something that I take for granted, perhaps because my hosting comes from the Bytemark, people do seem willing to pay money for DNS hosting.

Which is odd. I mean you could do it very very very cheaply if you had just four virtual machines. You can get complex and be geo-fancy, and you could use anycast on a small AS, but really? You could just deploy four virtual machines0 to provide a.ns, b.ns, c.ns, d.ns, and be better than 90% of DNS hosters out there.

The thing that many people mentioned was Git-backed, or Git-based DNS. Which would be trivial if you used tinydns, and no much harder if you used bind.

I suspect I'm "not allowed" to do DNS-things for a while, due to my contract with Dyn, but it might be worth checking...

ObRandom: Beat me to it. Register gitdns.io, or similar, and configure hooks from github to compile tinydns records.

In other news I started documenting early thoughts about my caching reverse proxy, which has now got a name stockpile.

I wrote some stub code using node.js, and although it was functional it soon became callback hell:

  • Is this resource cachable?
  • Does this thing exist in the cache already?
  • Should we return the server's response to the client, archive to memcached, or do both?

Expressing the rules neatly is also a challenge. I want the server core to be simple and the configuration to be something like:

is_cachable ( vhost, source, request, backened )
{
    /**
     * If the file is static, then it is cachable.
     */
    if ( request.url.match( /\.(jpg|png|txt|html?|gif)$/i ) ) {
        return true;
    }

    /**
     * If there is a cookie then the answer is false.
     */
    if ( request.has_cookie? ) { return false ; }

    /**
     * If the server is alive we'll now pass the remainder through it
     * if not then we'll serve from the cache.
     */
    if ( backend.alive? ) {
        return false;
    }
    else {
        return true;
    }
}

I can see there is value in judging the cachability based on the server response, but I plan to ignore that except for "Expires:", "Etag", etc ,etc)

Anyway callback hell does make me want to reexamine the existing C/C++ libraries out there. Because I think I could do better.

 

A diversion on off-site storage

Yesterday I took a diversion from thinking about my upcoming cache project, largely because I took some pictures inside my house, and realized my offsite backup was getting full.

I have three levels of backups:

  • Home stuff on my desktop is replicated to my wifes desktop, and vice-versa.
  • A simple server running rsync (content-free http://rsync.io/).
  • A "peering" arrangement of a small group of friends. Each of us makes available a small amount of space and we copy to-from each others shares, via rsync / scp as appropriate.

Unfortunately my rsync-based personal server is getting a little too full, and will certainly be full by next year. S3 is pricy, and I don't trust the "unlimited" storage people (backblaze,etc) to be sustainable and reliable long-term.

The pricing on Google-drive seems appealing, but I guess I'm loathe to share more data with Google. Perhaps I could dedicated a single "backup.account@gmail.com" login to that, separate from all-else.

So the diversion came along when I looked for Amazon S3-comptible, self-hosted, servers. There are a few, most of them are PHP-based, or similarly icky.

So far cloudfoundry's vlob looks the most interesting, but the main project seems stalled/dead. Sadly using s3cmd to upload files failed, but certainly the `curl` based API works as expected.

I looked at Gluster, CEPH, and similar, but didn't yet come up with a decent plan for handling offsite storage, but I know I have only six months or so before the need becomes pressing. I imagine the plan has to be using N-small servers with local storage, rather than 1-Large server, purely because pricing is going to be better that way.

Decisions decisions.

 

New GPG-key

I've now generated a new GPG-key for myself:

$ gpg --fingerprint 229A4066
pub   4096R/0C626242 2014-03-24
      Key fingerprint = D516 C42B 1D0E 3F85 4CAB  9723 1909 D408 0C62 6242
uid                  Steve Kemp (Edinburgh, Scotland) <steve@steve.org.uk>
sub   4096R/229A4066 2014-03-24

The key can be found online via mit.edu : 0x1909D4080C626242

This has been signed with my old key:

pub   1024D/CD4C0D9D 2002-05-29
      Key fingerprint = DB1F F3FB 1D08 FC01 ED22  2243 C0CF C6B3 CD4C 0D9D
uid                  Steve Kemp <steve@steve.org.uk>
sub   2048g/AC995563 2002-05-29

If there is anybody who has signed my old key who wishes to sign my new one then please feel free to get in touch to arrange it.

 

So I failed at writing some clustered code in Perl

Until this time next month I'll be posting code-based discussions only.

Recently I've been wanting to explore creating clustered services, because clusters are definitely things I use professionally.

My initial attempt was to write an auto-clustering version of memcached, because that's a useful tool. Writing the core of the service took an hour or so:

  • Simple KeyVal.pm implementation.
  • Give it the obvious methods get, set, delete.
  • Make it more interesting by creating a read-only append-log.
  • The logfile will be replayed for clustering.

At the point I was done the following code worked:

use KeyVal;

# Create an object, and set some values
my $obj = KeyVal->new( logfile => "/tmp/foo.log" );
$obj->incr( "steve" );
$obj->incr( "steve" );

print $obj->get( "steve" ) # prints 2.

# Now replay the append-only log
my $replay = KeyVal->new( logfile => "/tmp/foo.log" );
$replay->replay();

print $replay->get( "steve" ) # prints 2.

In the first case we used the primitives to increment a value twice, and then fetch it. In the second case we used the logfile the first object created to replay all prior transactions, then output the value.

Neat. The next step was to make it work over a network. Trivial.

Finally I wanted to autodetect peers, and deploy replication. Each host would send out regular messages along the lines of "Do you have updates made since $time?". Any that did would replay the logfile from the given unixtime offset.

However here I ran into problems. Peer discovery was supposed to be basic, and I figured I'd write something that did leader election by magic. Unfortunately Perls threading code is .. unpleasant:

  • I wanted to store all known-peers in a singleton.
  • Then I wanted to create threads that would announce and receive updates.

This failed. Majorly. Because you cannot launch the implementation of a class-method as a thread. Equally you cannot make a variable which is "complex" shared across threads.

I wrote some demo code which works without packages and a shared singleton:

The Ruby version, by contrast, is much more OO and neater. Meh.

I've now shelved the project.

My next, big, task was to make the network service utterly memcached compatible. That would have been fiddly, but not impossible. Right now I just use a simple line-based network protocol.

I suspect I could have got what I wanted using EventMachine, or similar, but that's a path I've not yet explored, and I'm happy enough with that decision.

 

That is it, I'm going to do it

That's it, I'm going to do it: I have now committed myself to writing a scalable, caching, reverse HTTP proxy.

The biggest question right now is implementation language; obviously "threading" of some kind is required so it is a choice between Perl's anyevent, Python's twisted, Rubys event machine, or node.js.

I'm absolutely, definitely, not going to use C, or C++.

Writing a a reverse proxy in node.js is almost trivial, the hard part will be working out which language to express the caching behaviour, on a per type, and per-resource basis.

I will ponder.