Skip to content

A diversion on off-site storage

Yesterday I took a diversion from thinking about my upcoming cache project, largely because I took some pictures inside my house, and realized my offsite backup was getting full.

I have three levels of backups:

  • Home stuff on my desktop is replicated to my wifes desktop, and vice-versa.
  • A simple server running rsync (content-free http://rsync.io/).
  • A "peering" arrangement of a small group of friends. Each of us makes available a small amount of space and we copy to-from each others shares, via rsync / scp as appropriate.

Unfortunately my rsync-based personal server is getting a little too full, and will certainly be full by next year. S3 is pricy, and I don't trust the "unlimited" storage people (backblaze,etc) to be sustainable and reliable long-term.

The pricing on Google-drive seems appealing, but I guess I'm loathe to share more data with Google. Perhaps I could dedicated a single "backup.account@gmail.com" login to that, separate from all-else.

So the diversion came along when I looked for Amazon S3-comptible, self-hosted, servers. There are a few, most of them are PHP-based, or similarly icky.

So far cloudfoundry's vlob looks the most interesting, but the main project seems stalled/dead. Sadly using s3cmd to upload files failed, but certainly the `curl` based API works as expected.

I looked at Gluster, CEPH, and similar, but didn't yet come up with a decent plan for handling offsite storage, but I know I have only six months or so before the need becomes pressing. I imagine the plan has to be using N-small servers with local storage, rather than 1-Large server, purely because pricing is going to be better that way.

Decisions decisions.

Comments On This Entry

  1. [gravitar] Andy Cater

    HP Microserver and 8TB worth of disks might be suitable and small enough to carry around.

  2. [author] Steve Kemp

    When I think off-site I think "remote", rather than remembering to juggle hardware around manually, though as you suggest external drives and similar will work for that.

    (+/- pi, cube, microserver to drive them.)

  3. [gravitar] Charles Darke

    What about using google and just encrypting all the data?

  4. [author] Steve Kemp

    Encryption is pretty much a given anyway, regardless of who hosts things.

    The trade-off is control vs. privacy and reliability.

  5. [gravitar] yuval

    Have you seen camlistore.org ?

  6. [gravitar] Uli Martens

    Depending on your data, you might take a look at git-annex (https://git-annex.branchable.com/walkthrough/ for a good start).

    It will probably work better with larger files than millions of small ones, but that's not an issue of whether it works, but rather of how fast. :)

    I'm using a mix of local storage, usb storage and remote storage to keep backups and media files both directly available and redundantly offsite/offline, which works great as each repository knows which data content (it's hashed) should be there, and i can configure how often (and where) the actual data is stored.

  7. [author] Steve Kemp

    I spent a while looking at camlistore, and xtreemfs, beyond that I've not been thinking of git-annex because I'm thinking more about the storage side than the transport.

    I figure a filesystem would be ideal, but equally a naive "blob-server" with a good client support and ability to retrieve will work. (Which is why I was looking at S3-compatible servers as well as replicating filesystems.)

    My only immediate comment is that Ceph feels fragile, camlistore seems nice, and Gluster is best avoided.

  8. [gravitar] Paulo Almeida

    Any particular reason why Ceph feels fragile? I only ask because I'm looking into deploying it myself.

  9. [author] Steve Kemp

    Perhaps fragile isnt' the right word. There are lots of components to the system, the object store, the index store, and I don't have a good feeling for which ones you can kill and restart and which you can't.

    Instead I should say that it feels like the software is broken down into distinct parts that seem to make sense, from the outside looking in, but I've not yet sure how failures at different levels are handled, and I can't see a decent discussion of that.

    (i.e. If you're storing data on, say cheap virtual machines, it seems obvious that one or two will die every now and again, so you need to be failure tolerant and I'm not sure how ceph handles that.)

  10. [gravitar] Nux

    Also check out https://tahoe-lafs.org/

  11. [gravitar] Tobe

    Backupsy works out well for me

  12. [gravitar] Paul Walker

    Hi

    Just wondering if you'd already considered/discarded rsync.net for some reason...? (Cost is a valid reason.)

    Paul

  13. [author] Steve Kemp

    I think the reason I avoided rsync.net was price, but I know I didn't look at it recently. I'm not sure why actually, I should go look.