Thursday, 3 April 2014
I'm an amateur photographer, although these days I tend to drop the amateur prefix, given that I shoot people for cash at least once a month.
(It isn't my main job, and I'd never actually want it to be, because I'm certain I'd become unhappy hustling for jobs and doing the promotion thing.)
Anyway over the years I've built up a large library of images, mostly organized in a hierarchy of directories beneath ~/Images.
Unlike most photographers I don't use aperture, lighttable, or any similar library management. I shoot my images in RAW, convert to JPG via rawtherapee, and keep both versions of the images.
In short I don't want to mix the "library management" functions with the "RAW conversion" because I do regard them as two separate steps. That said I'm reaching a point where I do want to start tagging images, and finding them more quickly.
In the past I wrote a couple of simple tools to inject tags into the EXIF data of images, and then indexed them. But that didn't work so well in practise. I'm starting to think instead I should index images into sqlite:
- Content hash.
The downside is that this breaks utterly as soon as you move images around on-disk. Which is something my previous exif-manipulation was designed to avoid.
Anyway I'm thinking at the moment, but I know that the existing tools such as F-Spot, shotwell, DigiKam, and similar aren't suitable. So I either need to go standalone and use EXIF tags, accepting the fact that the tags I enter won't be visible to other tools, or I cope with the file-rename issues by attempting to update an existing sqlite database via hash/size/etc.
Tags: images, itag, tags.
The alternative is that I create a text-file alongside every directory, or image, with data in it.
The former copes with renames, and the latter allows image-specific things. Generally all images in one "event" will have many common tags, whether a name "Elizabeth", or "Fun Run Edinburgh 2013", and only some variation occurs "monochrome", "colour", etc.
Hm, you already mentioned hashes in the last paragraph; even before that I'd started thinking about something like the way git-annex does things: it replaces files with symlinks to its own copy of the file, and its own copy's name is based on the file size and a hash of the file's contents.
So... maybe something like this? Keep the hash of the file in the SQLite database and, for easier maintenance (and not to have to recalculate the hash each time you actually want to access/reindex/check the file), keep the files themselves in a git-annex-style repository?
Of course, all of this has... interesting issues once you decide to modify a file :)
I suspect, but don't know for sure, that using symlinks would best be avoided.
Specifically because I'd fear that lots of the CR2 -> JPG conversion assumes that output will go to the same directory and that might break if it went to the real location, rather than the linked one.
I guess there's also the tension between trying to keep useful names - currently mine are largely based around $Year.$month/$name/ - and the hash. Obviously RAW files don't get edited, but JPGs do, even if just resized or cropped differently at different times. I think that's enough to make me think renaming based on hashes is a bad idea, but that using hashes for detecting duplicates might work.
(Image duplicate-detection is a fun problem. 100% identical files can be found with ease, but if the meta-data changes, or the image is cropped/resized you need to spend a lot of processing power to discover that. Ditto for even colour->monochrome conversion.)
do you know 'pho'? It's an image viewer with keyword capabilities built for mass keyword entry.
From what I have understood it stores the keywords in a separate file.
maybe this is what you are looking for
Rather than keeping parallel metadata files paired up with images, what are your thoughts about keeping specific metadata fields as extended (user) attributes?
Also, your sqlite database could merely be a cache of those values, which you could rebuild by comparing a conceptually ephemeral list of attributes in the database (eg. device+inode, full path, atime/ctime/mtime, sha1sum stored in metadata, sha1sum computed from content, etc) against the actual filesystem contents. That should give your cache refresh job a list of attributes to check in ascending order of computational cost, plus opportunities to recognize when files have merely been moved around.
For me directories with places or theme names work well. Sometimes there are sub directories with date. I have another annoyance. The photo numbers/file names. I own two Canon cameras, I also had a third. There is always a chance that a file name already exists, when I copy from CF to a directory on my usb disk. Cameras should have a menu option to choose a file name prefix. I hate that ugly IMG. For quick selecting and copying I prefer geeqie.
git-annex can also run in 'direct' mode, where the files are not replaced by symlinks as long as they are locally available. The downside is that most git commands can not be used anymore. Regarding sqlite, git-annex just added a metadata mechanism (http://git-annex.branchable.com/design/metadata/) which sounds a lot like what you are thinking about.
pho seems interesting, but I'd also want to be able to add tags/keywords on a whole directory, rather than individually. I will experiment to see if it supports that.
User-attributes on the filesystem is an interesting idea, not something I'd considered. Thanks!
Brun0 if you have a big enough camera you can set the image prefix. e.g. My canon 5d MKIII allows me to change "IMG_" to "MK3_" to avoid collisions with the MKII/backup body I have.
Brun0 et. al. you can use Rapid Photo Downloader to rename your images to your own specification as you simultaneously download them and back them up: http://www.damonlynch.net/rapid
Steve I suspect you may profit from learning how photographers who work with hundreds of thousands of RAW files manage the collections. In my case I keep my RAW files totally separate from JPEGs.
IMHO what we need on Linux is something similar to the metadata editing features of Photo Mechanic -- one that works with industry standard metadata formats i.e. IPTC & XMP. I don't think it would be particularly hard, especially for someone well versed in QT or GTK, but sadly I don't have time myself.
Such a program could potentially be quite powerful in conjunction with dmedia.
I'm curious what you think I'd learn from folk who deal with thousands of images, that I don't already know?
I've frequently interacted with photographers to discuss work-flow, and what I've learned most is that people have very personal routines, largely as a result of personal preference. Beyond that everybody, me included, will first "import" their images of the day, (optionally taking a backup at that time) , then takes an initial pass at the images discarding the mis-fires, and selecting the good images.
Later folks convert, discard some more, edit/process.
I've watched folks process "a wedding", or a corporate shoot, and although specifics vary due to different tools, such as light-room, aperture, the broad use-cases are always roughly the same. (Tagging/Rating images, and similar.)
git annex recently gained support for tagging content.
I personally like the idea to put tags right into the image information (and I wonder where I can find your code for doing so to enable me doing some experiments without reinventing the wheel). If an image would carry its own tags you would become independent from any tool and can move around all your images without loosing your data stored in whatever database. If this idea might become popular there should be ways to patch shotwell + friends to deal with these tags.
My home-made exif-tagger is about four years old at this point, but you can get it via:
hg clone http://itag.repository.steve.org.uk/
ResourceSpace is a web based DAM application that uses metadata (exif & xmp) to organize images. Unfortunatley, it keeps the images in it's own filestore. It does have an option to leave images alone in their seperate folder, but keeps the metadata in it's own database.
Hi Steve, the Library of Congress thought it was worth funding the American Society of Media Photographers to research the issue. I think we could all learn from their insights, yes? http://www.dpbestflow.org
If any code comes out of this discussion I very much hope sanity prevails and industry metadata standards are rigorously adhered to.
Comments are closed on posts which are more than ten days old.
- 7 May 2015
- 4 May 2015
- 27 April 2015
- 18 April 2015
- 14 April 2015
- 11 April 2015
- 14 March 2015
- 6 March 2015
- 27 January 2015
- 6 January 2015