Who’s got the tag? Database truth versus file truth, part 3

I’ve recently been exploring the implications of the following mantra:

The truth is in the file.

In this context it refers to a strategy for managing metadata (e.g., tags) primarily in digital files (e.g., JPEG images, Word documents) and only secondarily in a database derived from those files.

Commenting on an entry that explores how Vista uses this technique for photo tags, Brian Dorsey throws down a warning flag:

Many applications are guilty of changing JPEGs [ed: RAW file, not JPEGs, are the issue, see below] behind the scenes and there is nothing forcing them to do it in compatible ways. Here is a recent example with Vista.

A cautionary tale, indeed. This is the kind of subject that doesn’t necessarily yield right and wrong answers. But we can at least put the various options on the table and discuss them.

There is an interesting comparison to be made, for example, between OS X and Vista. While researching this topic I found this Lifehacker article on a feature of OS X that I completely missed. You can tag a file in the GetInfo dialog, and when you do, the file will be instantly findable (by that tag) in SpotLight.

My purpose here is not to discuss or debate the OS X and Vista interfaces for tagging files and searching for tagged files. I do however want to explore the implications of two different strategies: “the truth is in the file” versus “the truth is in the database”.

In Vista, if I tag yellowflower.jpg with iris, that tag lives primarily in the file yellowflower.jpg and secondarily in a database. An advantage is that if I transfer that file to another operating system, or to a cloud-based service like Flickr, the effort I’ve invested in tagging that file is (or anyway can be) preserved. A disadvantage, as Brian points out, is that when different applications try to manage data that’s living inside JPEG files, my investment in tagging can be lost.

Conversely, if I tag yellowflower.jpg with iris in OS X, yellowflower.jpg is untouched, the tag only lives in Spotlight’s database. If I transfer the file elsewhere, my investment in tagging is lost. But on my own system, my tags are less vulnerable to corruption.

Arguably these are both valid strategies. The Vista way optimizes for cross-system interoperability and collaboration, while the OS X way optimizes for single-system consistency. Of course as always we’d really like to have the best of both worlds. Can we?

It’s a tough problem. Vista tries to help with consistency by offering APIs in the .NET Framework for manipulating photo metadata. But those APIs don’t yet cover all the image formats, and even if they did, there’s nothing to prevent developers from going around them and writing straight to the files.

For its part, OS X offers APIs for querying the Spotlight database. So an application that wanted to marry up images and their metadata could do so, but there’s no guarantee that a backup application or a Flickr uploader would do so.

It’s an interesting conundrum. Because I am mindful of the lively discussion over at Scoble’s place about what matters to people in the real world, though, I don’t want to leave this in the realm of technical arcana. There are real risks and benefits associated with each of these strategies. And while it’s true that people want things to Just Work, that means different things to different people.

If you’re an avid Flickr user, if you invest effort tagging photos in OS X, and if that effort is lost when you upload to Flickr, then OS X did not Just Work for you. Conversely if you don’t care about online photo sharing, if you invest effort tagging photos in Vista, and then another application corrupts your tags, then Vista did not Just Work for you.

I think many people would understand that explanation. In principle, both operating systems could frame the issue in exactly those terms, and could even offer a choice of strategy based on your preferred workstyle. In practice that’s problematic because people don’t really want choice, they want things to Just Work, and they’d like technology to divine what Just Work means to them, which it can’t. It’s also problematic because framing the choice requires a frank assessment of both risks and benefits, and no vendor wants to talk about risks.

I guess that in the end, both systems are going to have to bite the bullet and figure out how to Just Work for everybody.

Posted in .

36 thoughts on “Who’s got the tag? Database truth versus file truth, part 3

  1. Perhaps we could get from two 80% solutions to a 95% solution by simply modifying Vista to automatically replace tags in files using the secondary copy in the database whenever the primary tag appears to have been erased.

    While this won’t protect against a program deliberately or otherwise overwriting a tag with an empty value, it would address the common case where an old utility removes current tags when saving a file.

    I’d like to think that over time, this will become less of a problem — programs that “don’t play nice” with tags will gain a bad reputation, which in turn acts as an incentive for them to be updated or replaced.

    I strongly agree that users should not be asked to make a choice on this though; it needs to Just Work.

  2. The OS X thing surprises me. There’s all sorts of hooks in the filesystem so that saving anything triggers a callback to spotlight for it to update its data from the saved file. Given all that infrastructure it seems odd that Finder Info doesn’t write tags into the file (for file formats where that makes sense). In theory it should be possible to write a super info box that does exactly that, it’s just a shame it’s not the default behaviour.

  3. A lot of the balance of the choice can be on how much the type of data will move, in that, how many times would you transfer this binary thing from place to place. If you go for a search/index strategy and you move things around a few times then the ‘source of truth’ for the metadata is probably best ‘on’ the item. Conversely for things that don’t ‘travel’ then there is attractive efficiency of keeping that nice hash all together.

    Having spent a year tagging things that ‘move’, i.e. emails, I’ve come around to the opinion that ultimately the best place for metadata on opaque things is as close as you can get it with them. What makes a tag a tag, in my opinion, and not just a category, label or topic, is that the ‘putting on’ part can be a community activity, and not just the community ‘finding’ part: putting the tag *with* the data makes that community of adding the tags stand a better chance of working.

    David

  4. I recently got a new DSLR camera and was shocked at how much information was placed in the JPGs I downloaded from it. The lens I had on it, the f-stop and shutter speed, basically more than I would have thought to write down were I the type of photographer to write such things down. I’ve heard many cameras will also have GPS data in them so you will even know where the picture was taken. This “metadata” shows up in the properties box for the Linux KDE desktop. In fact I think you can add metadata to just about any file, though I haven’t experimented with how well such information travels to other systems.

    One annoying thing about OS X is that for some files (and I’m guessing it is the same data you are referring to) each file in a folder is accompanied by a similarly named file with a dot as the first character. In OS X such files are completely invisible using the GUI (as far as I can tell) but can be viewed in the terminal interface using “ls -a”. Unfortunately when you copy a folder of such files to a non-OS X system the files become much more visible and can be easily confused with their primary data counterparts. More than once I’ve erased the “dot” file or copied it to another folder thinking I was dealing with the actual data file. After doing that a time or two I started routinely deleting all these dot file when I transfer them from OS X to Linux. Of course if I wanted to preserve this association I’d have to remember with every movement of the file on a Linux box to move BOTH files. I’ve never tried to modify one of these files to see what would happen when transfered back to OS X. Hopefully it would handle such changes gracefully. I don’t know if these files are naturally invisible to Windows or if there is a way to make them so.

    As some there at MS might remember OS/2 also had something akin to metadata that I think was referred to as the extended file attributes (something like that). If memory serves me all that metadata was hidden in one gigantic file that (in my case) seemed to constantly be getting corrupted. I’d love to be a fly on the wall at some of the design meetings where they decide where to hide all this additional info. I think making it part of the actual file is the most sensible approach, as long as it can be done in such a way that an empty metadata component doesn’t take up much space (I think XML satisfies this). With the exception of “flat” files like CSV or TXT don’t most file types have means to tack on this extended data that is part of the definition of the file structure?

  5. If I tag a file on OS X in the Finder, that tag is portable and search-able on other OS X systems just fine, so I believe that tag is moving with the file in some way, not only residing in a system database.

  6. “In practice that’s problematic because people don’t really want choice, they want things to Just Work, and they’d like technology to divine what Just Work means to them, which it can’t.”

    As a computer support person I have gotten of people calling up asking why they couldn’t do ‘X’ on their computer. It never mattered what it was exactly, from setting a control panel to setting a Program preference, people absolutely didn’t want to ‘know’ anything.

    So I would joke about the one stop shop, be-all, end-all solution for these folks. It was called the “Intention Control Panel”. For those people, all they needed to do was go and turn on the Intention Control Panel and suddenly the computer would ‘know’ what they wanted it to do. It might have a slider that would allow the amount of guessing to increase or decrease, but the Computer would always try to divine the intention of whomever was sitting at the keyboard. We need an intention control panel if not for apps maybe for all OSes as well.

  7. Jon, I just moved all of my documents over to a new Mac. I didnt even think about backing up Spotlight settings, so it was a genuinely fresh install, with my Word docs, PDFs etc the only things making the move.

    Lo and behold, the Spotlight comments for all of my files were still intact. Therefore the metadata must be stored somewhere other than the Spotlight index. I’ve since learned the data is embedded into the hidden .DS Store file. Not sure I like the solution. Seems like a stop gap from Apple.

  8. “The Vista way optimizes for cross-system interoperability and collaboration, while the OS X way optimizes for single-system consistency.”

    I think you missed what’s really going on here. The “Vista way” (truth is in the file) optimizes for files that the OS knows how to write metadata into. The “OS X way” (truth is in the database) allows for independence of files from OS, so the OS can tag files whose content it can’t understand or which simply don’t support metadata.

    The only good answer I can see is the “OS X” way, but augmented by syncing of database-hosted metadata into metadata-supporting files, where the user allows it. I hate proliferation of user options, but the same person can reasonably say “don’t touch my damn file” and “don’t lose my metadata just because I change computers” at the same time (about different files, of course).

  9. “Having spent a year tagging things that ‘move’, i.e. emails, I’ve come around to the opinion that ultimately the best place for metadata on opaque things is as close as you can get it with them.”

    Funny you should mention that. Email is of course where we need tagging the most. In my Internet Groupware book way back in 1999 I talked about a couple of ways tags could live in email — either as custom X- headers, or as XML attachments. The idea was that tags would accumulate with email threads, so everyone on the thread could contribute and benefit.

    I’m sure there are a zillion usability problems with that scheme but, until we try it, we’ll never find out what they are, and we’ll never get a chance to see if we can make it work.

  10. I think I agree with the last two posts :)

    Of course the BEST solution would be an industry standard for tacking metadata onto the front or back of any file. Don’t almost all PC file systems have two indicators for file size? One being the directory file size and the other being a special “zero” byte, that could at least be used as a delimiter for non-binary data. It would of course take years for such a standard to catch on. But it will be forever if nobody starts.

    The other approach would be to define a standard API for all metadata that all OSs could agree to implement in (at least externally) much the same way, Java and .Net and C library calls that would work the same everywhere, and even a command line interface that would be identical. We need to move from the era where OS vendors “innovate” by changing “ls” to “dir” and “/” to “\”.

    The bulk of what OSs do is a commodity at this point. If the vendors would just recognize this (I think Apple largely has, but still needs work) and leave the product differentiation completely in the applications the world would be a better place.

    I want to be able to run Windows with Firefox and Open Office, I want to be able to run the Konqueror file manager on OS X (well, I can but it’s a PITA) or Windows, and I would gladly buy a copy of Office to run on Linux (but not using an emulator). The glass is about 1/3 full. Let’s keep working on it.

  11. “The only good answer I can see is the “OS X” way, but augmented by syncing of database-hosted metadata into metadata-supporting files”

    I guess that’s sort of what Vista does. Or actually, not Vista but PhotoGallery. If the file can’t host metadata, or — as we saw in the earlier item, if it could but there isn’t yet Framework support for tweaking the metadata — then the fallback is to use a database.

    (Though in this case not, I don’t think, a systemwide one, but one that’s private to PhotoGallery. If I’ve got that wrong, I hope Scott Dart will jump in to correct me.)

    Note that there’s another subtle usability issue with this hybrid approach. How do you know, when copying a set of files someplace else, which will bring their metadata with them and which won’t? You don’t know. And warning people about that kind of thing tends to be a nonstarter.

    This is tricky stuff.

  12. Sounds like what’s needed is a service something like Windows File Protection but for tags. It would monitor changes on files and the if tags are changed and are no longer equal to what’s in the database, it could notify the user of the change. They could also have the option of turning off notifications on a by-application basis similar to how a personal firewall works with apps trying to access outbound ports.

  13. “Got 3 words fer ya: OS/2 Extended Attributes. :-) Sounds pretty close to what the .DS Store files on OS/X do, actually.”

    But OS/2 EAs and NTFS streams don’t travel to foreign filesystems. The .DS stuff does, though it’s problematic what you do with it when it gets there.

  14. Jon,
    As an emacs neophyte I am deeply interested in how you use emacs to do, in your words, “structured writing.”

  15. The answer to the problem of ‘just working for everybody’ is pretty well established: standards. The examples you’ve used illustrate the emergence of several of them (though clearly there is a long way yet to go in this space)
    – XMP: an Adobe syntax for carrying arbitrarily scheme-declared metadata – authority derives from the fact that a lot of the world’s documents are managed at some time in adobe products.
    – Dublin Core: a community-developed, ISO standard – authority derives from formal process and wide adoption.
    – Vista-photo metadata: which, based on the tiny fragment you exposed in your example, is just an MS-spin on a set of descriptors that will be found in many (most) photo systems. Authority derives from the OS hegemony of MS.

    Wondering if IPTC is anywhere in the mix?

    What is missing from this picture?

    1. cross platform standardization and elimination of redundancy in vocabularies (requires cooperation that has yet to emerge in this market).

    2. Standardization of process: everyone agrees on not just the vocabulary, but on the underlying model. not holding my breath.

  16. Another wrinkle to consider: my backup scheme involves writing the directories containing my digital pictures to a CD-R, and performing a diff to ensure everything copied OK. Later on just before I write the next backup set, I also do a diff to ensure the images haven’t been corrupted on the hard disk since I wrote the original backup. If tags I create after writing the first backup land in the image files, then the diff I execute can’t be relied upon to show me “corrupted files”. So perhaps this suggests another option: put the metadata in a separate (XML?) file (in the same directory as the images?). Then the image files can be left pristine, the metadata is easy to see and backup and migrate to other systems, and also easy for a system like OS X/Spotlight or Vista to access and update.

  17. Hi Jon,
    What about a “universal file format metawrapper”. Such that the wrapped files can only be opened by applications that support the format and treat metadata nicely. The format as having a header with metadata about the file type it contains for OS file typing, a unique identifier(to solve your photo tagging dump in a directory issues) and with other metadata that users add to the file. Applications not in the know wouldn’t be able to read the files and corrupt the metadata because of the wrapper.

    While it’s not a solution for existing application file types, as that would likely have to be done at file system level with an API, I believe it would cater for future usage.

    Operating systems of the future(and upgrades to existing ones) could natively support this metawrapper such that legacy applications could read and write the files. Redundancy could be achieved for legacy apps on native metawrapper OSs by storing the metadata in the existing file formats and in the metawrapper. The OS backing up known filetype metadata. Metawrapper plugins could exist to extend that. When there’s a change the user could be alerted by the OS to take action. New formats that people create would use the metawrapper. The metawrapper having redundancy in the form of a metadata update history.

    And maybe if we plugged this into web services, we could annote URIs with metadata using a standardised ReSTful metawrapper API and build a data web.

    I hope I’m not missing something obvious.

  18. “What about a “universal file format metawrapper””

    I love it. How we get from here to there does, however, require some pretty energetic handwaving :-)

  19. Your discussion is a good one but that post from Lifehacker is grossly over simplied and you have interpreted inaccurately.

    >Many applications are guilty of changing JPEGs behind the scenes and there is nothing forcing them to do it >in compatible ways. Here is a recent example with Vista.

    Flat out wrong. Read the article the post links to? There is no mention of JPEGs whatsoever.

    As the article quite clearly notes – The problem is not with JPEG files but with RAW. All camera makers have different versions of RAW which is one reason why Adobe wants a standard RAW format. Almost any app mangles the makernote field in a RAW because all camera makers do it differently. Even the kb notes that “This metadata is specific to the manufacturer of the camera.” This is a typical lazy blogger, lazier blog reader problem because the issue is not just with Vista. If you edit RAW information with photoshop you can have problems with other applications. It is an industry problem and there are applications that let you save the original metadata before using other apps to edit.

    JPEG is a not a camera specific format so you can change the metadata with different applications without much less concern losing anything. I have tagged JPEGs for years, back and forth with various apps and there have never been any problems. With standard formats the issue of file corruption on your machine is much less of an issue than you make it out to be.

  20. “If you’re an avid Flickr user, if you invest effort tagging photos in OS X, and if that effort is lost when you upload to Flickr, then OS X did not Just Work for you.”

    Only if you cannot think of a simple system to auto-transfer that metadata.

    For instance, I just used a free script at Doug’s AppleScripts (http://www.dougscripts.com/itunes/) called “All Tag Data to File Comments” to transfer a bunch of tags including ratings from a selection of iTunes songs to their associated files on disk. Next those files were moved into a folder and deleted from iTunes: a simple archiving process. Metadata for those songs are now in the comments and thus can be read back into iTunes should I choose to import them with a reverse script.

    It’d be trivial then to do the same with a Flickr uploading app that parses the metadata in file GetInfo comments and populates Flicker tags.

  21. “Only if you cannot think of a simple system to auto-transfer that metadata.”

    I can. You can. The vast majority of folks cannot.

  22. I’ve been contemplating the same thing Jon. I figured that when I tagged my photos in iPhoto the tags would only work in iPhoto. I know that with mp3’s there is an integrated tag structure. Not so with photos. If the designers of image formats could come up with a way to have tag fields withing the file that would be idea. It would remove the risk of corruption because it would be built into the format and part of the standard.

  23. …Hmmm…

    Sounds like the problem in Vista is essentially a bug in either camera manufacturer’s software, Vista, or the design of metadata in Vista files. Basically, there’s no valid reason that that metadata should ever get “corrupted”.

    What I originially *thought* your article would be about, is this: what happens if I replace all the pixels in yellowflower.jpeg with an image of a Lilly instead of an Iris? Then the metadata is just wrong vs. the object it purports to describe (rather than “corrupted”). Now that’s a tricky problem.

  24. [1] On the same file, my tags may not be the same as your tags.
    [2] Identical file copy should not include any modifications by user.
    So, I think OS X’s way is better.

  25. Why not employ both techniques? OS X could have an option called “export file,” and then add the metadata to the file from Spotlight’s database in a way that something like Flickr or Vista would understand.

  26. “I would gladly buy a copy of Office to run on Linux (but not using an emulator). The glass is about 1/3 full. Let’s keep working on it.”

    You can. Crossover Office does not use any form of emulation. Or, if you’d rather not pay anything beyond Office’s ridiculous price tag, you can use vanilla WINE. Wine is not an emulator.

  27. I didn’t read all the other comments but it seems the best way of doing it would be storing tags primarily in the database. Then you could provide a hook API that allows the DB to check the file’s tag integrity whenever it gets referenced. Any files missing their corresponding tags get them replaced automatically.

  28. Pingback: link

Leave a Reply