Steve Kemp's Blog Writings relating to Debian & Free Software

On de-duplicating uploaded file-content.

Thursday, 7 May 2015

This evening I've been mostly playing with removing duplicate content. I've had this idea for the past few days about object-storage, and obviously in that context if you can handle duplicate content cleanly that's a big win.

The naive implementation of object-storage involves splitting uploaded files into chunks, storing them separately, and writing database-entries such that you can reassemble the appropriate chunks when the object is retrieved.

If you store chunks on-disk, by the hash of their contents, then things are nice and simple.

The end result is that you might upload the file /etc/passwd, split that into four-byte chunks, and then hash each chunk using SHA256.

This leaves you with some database-entries, and a bunch of files on-disk:

/tmp/hashed/ef267892ee080862c96a8d2d05de62f48e20f0875f27379e7d58c73ea4455bf1
/tmp/hashed/a378977155fb42bb006496321cbe31f74cbda803c3f6ca590f30e76d1afad921
..
/tmp/hashed/3805b0245bc8375be7125ae228eef711552ac082ffb9bf8756e2964a2393a9de

In my toy-code I wrote out the data in 4-byte chunks, which is grossly ineffeciant. But the value of using such small pieces is that there is liable to be a lot of collisions, and that means we save-space. It is a trade-off.

So the main thing I was experimenting with was the size of the chunks. If you make them too small you lose I/O due to the overhead of writing out so many small files, but you gain because collisions are common.

The rough testing I did involved using chunks of 16, 32, 128, 255, 512, 1024, 2048, and 4096 bytes. As sizes went up the overhead shrank, but also so did the collisions.

Unless you could handle the case of users uploading a lot of files like /bin/ls which are going to collide 100% of the time with prior uploads using larger chunks just didn't win as much as I thought they would.

I wrote a toy server using Sinatra & Ruby, which handles the splitting/hashing/and stored block-IDs in SQLite. It's not so novel given that it took only an hour or so to write.

The downside of my approach is also immediately apparent. All the data must live on a single machine - so that reassmbly works in the simple fashion. That's possible, even with lots of content if you use GlusterFS, or similar, but it's probably not a great approach in general. If you have large capacity storage avilable locally then this might would well enough for storing backups, etc, but .. yeah.

| 7 comments.

 

A weekend of migrations

Monday, 4 May 2015

This weekend has been all about migrations:

Host Migrations

I've migrated several more systems to the Jessie release of Debian GNU/Linux. No major surprises, and now I'm in a good state.

I have 18 hosts, and now 16 of them are running Jessie. One of them I won't touch for a while, and the other is a KVM-host which runs about 8 guests - so I won't upgraded that for a while (because I want to schedule the shutdown of the guests for the host-reboot).

Password Migrations

I've started migrating my passwords to pass, which is a simple shell wrapper around GPG. I generated a new password-managing key, and started migrating the passwords.

I dislike that account-names are stored in plaintext, but that seems known and unlikely to be fixed.

I've "solved" the problem by dividing all my accounts into "Those that I wish to disclose post-death" (i.e. "banking", "amazon", "facebook", etc, etc), and those that are "never to be shared". The former are migrating, the latter are not.

(Yeah I'm thinking about estates at the moment, near-death things have that effect!)

| No comments

 

Validating puppet manifests via git hooks.

Monday, 27 April 2015

It looks like I'll be spending a lot of time working with puppet over the coming weeks.

I've setup some toy deployments on virtual machines, and have converted several of my own hosts to using it, rather than my own slaughter system.

When it comes to puppet some things are good, and some things are bad, as exected, and as any similar tool (even my own). At the moment I'm just aiming for consistency and making sure I can control all the systems - BSD, Debian GNU/Linux, Ubuntu, Microsoft Windows, etc.

Little changes are making me happy though - rather than using a local git pre-commit hook to validate puppet manifests I'm now doing that checking on the server-side via a git pre-receive hook.

Doing it on the server-side means that I can never forget to add the local hook and future-colleagues can similarly never make this mistake, and commit malformed puppetry.

It is almost a shame there isn't a decent collection of example git-hooks, for doing things like this puppet-validation. Maybe there is and I've missed it.

It only crossed my mind because I've had to write several of these recently - a hook to rebuild a static website when the repository has a new markdown file pushed to it, a hook to validate syntax when pushes are attempted, and another hook to deny updates if the C-code fails to compile.

| 3 comments.

 

skx-www upgraded to jessie

Saturday, 18 April 2015

Today I upgraded my main web-host to the Jessie release of Debian GNU/Linux.

I performed the upgraded by changing wheezy to jessie in the sources.list file, then ran:

apt-get update
apt-get dist-upgrade

For some reason this didn't upgrade my kernel, which remained the 3.2.x version. That failed to boot, due to some udev/systemd issues (lots of "waiting for job: udev /dev/vda", etc, etc). To fix this I logged into my KVM-host, chrooted into the disk image (which I mounted via the use of kpartx), and installed the 3.16.x kernel, before rebooting into that.

All my websites seemed to be OK, but I made some changes regardless. (This was mostly for "neatness", using Debian packages instead of gems, and installing the attic package rather than keeping the source-install I'd made to /opt/attic.)

The only surprise was the significant upgrade of the Net::DNS perl-module. Nothing that a few minutes work didn't fix.

Now that I've upgraded the SSL-issue I had with redirections is no longer present. So it was a worthwhile thing to do.

| No comments

 

Subject - Verb Agreement

Tuesday, 14 April 2015

There's pretty much no way that I can describe the act of cutting a live, 240V mains-voltage, wire in half with a pair of scissors which doesn't make me look like an idiot.

Yet yesterday evening that is exactly what I did.

There were mitigating circumstances, but trying to explain them would make little sense unless you could see the scene.

In conclusion: I'm alive, although I almost wasn't.

My scissors? They have a hole in them.

| 5 comments.

 

Some things get moved, some things get doubled in size.

Saturday, 11 April 2015

Relocation

We're about three months away from relocating from Edinburgh to Newcastle and some of the immediate panic has worn off.

We've sold our sofa, our spare sofa, etc, etc. We've bought a used dining-table, chairs, and a small sofa, etc. We need to populate the second-bedroom as an actual bedroom, do some painting, & etc, but things are slowly getting done.

I've registered myself as a landlord with the city council, so that I can rent the flat out without getting into trouble, and I'm in the process of discussing the income possabilities with a couple of agencies.

We're still unsure of precisely which hospital, from the many choices, in Newcastle my wife will be stationed at. That's frustrating because she could be in the city proper, or outside it. So we need to know before we can find a place to rent there.

Anyway moving? It'll be annoying, but we're making progress. Plus, how hard can it be?

VLAN Expansion

I previously had a /28 assigned for my own use, now I've doubled that to a /27 which gives me the ability to create more virtual machines and run some SSL on some websites.

Using SNI I've actually got the ability to run SSL almost all sites. So I configured myself as a CA and generated a bunch of certificates for myself. (Annoyingly few tutorials on running a CA mentioned SNI so it took a few attempts to get the SAN working. But once I got the hang of it it was simple enough.)

So if you have my certificate authority file installed you can browse many, many of my interesting websites over SSL.

SSL

I run a number of servers behind a reverse-proxy. At the moment the back-end is lighttpd. Now that I have SSL setup the incoming requests hit the proxy, get routed to lighttpd and all is well. Mostly.

However redirections break. A request for:

  • https://lumail.org/docs

Gets rewritten to:

  • http://lumail.org/docs/

That is because lighttpd generates the redirection and it only sees the HTTP connection. It seems there is mod_extforward which should allow the server to be aware of the SSL - but it doesn't do so in a useful fashion.

So right now most of my sites are SSL-enabled, but sometimes they'll flip to naked and unprotected. Annoying.

I don't yet have a solution..

| 5 comments.

 

Moving to Newcastle

Saturday, 14 March 2015

Although things are not 100% certain it seems highly likely we'll be moving to Newcastle in five months time.

If I seem distracted/absent/busy over the next month or two this will be a good excuse!

| 4 comments.

 

Free hosting, and key-signing

Friday, 6 March 2015

Over the past week I've mailed many of the people who had signed my previous GPG key and who had checked my ID as part of that process. My intention was to ask "Hey you trusted me before, would you sign my new key?".

So far no replies. I may have to be more dedicated and do the local-thing with people.

In other news Bytemark, who have previously donated a blade server, sponsored Debconf, and done other similar things, have now started offering free hosting to Debian-developers.

There is a list of such offers here:

I think that concludes this months blog-posting quota. Although who knows? I turn 39 in a couple of days, and that might allow me to make a new one.

| 2 comments.

 

Recording gym-visits on Linux.

Tuesday, 27 January 2015

I go to the gym every couple of days. I lift things up, then put them down, and sometimes I repeat this process another 30 times. When I'm done I write down what I've done, how many times I did the lifty-droppy thing, and so on.

I want to see pretty graphs. I want to have records of different things. I guess I just need some simple text-boxes:

   deadlift  3 x 7 @ 210lbs.

etc. Sometimes I use machines so I'd say instead:

  converging seated-row  3 x 8 @ 150lbs

Anyway that's it. I want a simple GUI, a bit like a spreadsheet where I can easily add rows of each session. (A session might have 10-15 exercises in it, so not many.) I imagine some kind of SQLite database for the back-end. Or CSV. Either works.

Writing GUI software is hard. I guess I should look at GtK or Qt over the next few days and see if it is going to be easier to do it online via a jQuery + CGI system instead. To be honest I expect doing it "online" is liable to be more popular, but I think a desktop toy-application is just as useful.

| 6 comments.

 

Here we go again.

Tuesday, 6 January 2015

Once upon a time I worked from home for seven years, for a company called Bytemark. Then, due to a variety of reasons, I left. I struck out for adventures and pastures new, enjoyed the novelty of wearing clothes every day, and left the house a bit more often.

Things happened. A year passed.

Now I'm working for Bytemark again, although there are changes and the most obvious one is that I'm working in a shared-space/co-working setup, renting a room in a building near my house instead of being in the house.

Shame I have to get dressed, but otherwise it seems to be OK.

| 3 comments.

 

Spiral Logo

Search

Recent Posts

Recent Tags

Links

RSS Feed

  • Subscribe to feed