December 16, 2005

The Modern Palimpsest

The following is a brief summary of a talk I gave recently at the Ingenta Publisher Forum on the 28th November. The slides are available as a Powerpoint presentation.

In the presentation I tried to highlight some of the possibilities that could become available if academic publishers begin to share more metadata about the content they publish, ideally by engaging with the scientific community to expose "raw" data and results.

The conceit around which I hung the presentation was the suggestion that the scientific paper is the modern equivalent of a palimpsest.

A palimpsest, as Wikipedia will tell you is a scroll or manuscript that has been written on, had its text scraped off, and then reused. The practice was common in Medieval times when the costs of publishing were very high and it was cheaper to destroy a copy of another work than manufacture new parchment.

A great deal of success has been made in extracting the original texts from these works. Probably the most famous example is the Archimedes Palimpsest (some nice photos here).

The underlying text is known as the scriptio inferior, and may actually be more valuable than the more visible content.

I likened the process of authoring a scientific paper to that of the creation of a palimpsest. Starting from original research results and working through the synthesis of a cogent explanation of the results or discovery, at each step the content becomes more abstracted from the original results, the previous work being "lost" to the reader.

Data is presented in pre-analysed forms and is not amenable to reuse. Like the palimpsest the raw data has not really been lost, its just not (easily) accessible to the reader.

If the scriptio inferior, the underlying data, were made available to the reader, then there a lot of interesting possibilities arise.

This idea isn't new of course. Many scientists have been pushing for this for many years. However with the general trends towards open data sharing and "Web 2.0", the time is perhaps ripe for the web development and scientific communities to engage with one another to try and make more of these ideas a reality.

In my presentation I tried to stick to a pragmatic and practical line and demonstrate the possibilities by referring to actual examples. I ended up pointing to three:

Firstly I demonstrated how I could re-plot the some results of a geological study using Google Maps. This served to highlight the interactivity in modern web applications, and illustrate how more compelling and dynamic interfaces can be made from existing data sets. The very trivial demo is available online. It wouldn't be hard to envisage an application that amassed data from a range of related but independent studies in order to provide an alternative way of navigating through the corpus of documents.

iSpecies is a nice example of a science "mashup" that illustrates an alternative search interface for finding related content. I used the false results that can appear when performing simple keyword searches to reinforce the need for standard identifiers. (The need for a common, scoped identifier for authors, is a particular hobby horse of mine)

I also showed the excellent HubMed as an example of how both an alternative user interface can be better than the original, and also how content can be "enriched" by mixing in other sources. The "terms" feature which dynamically links keywords in an abstract through to a number of data sources, demonstrates this very well. I used the fact that material can be sourced from user contributed sources such as Wikipedia, to promote the idea that content needn't be fixed at the point of publication but can be annotated after the fact.

The general theme of the forum was "reaching new markets" and I closed the presentation by suggesting that making more data open and the content more engaging might help promote the role of the amateur in science.

There seems to be a lot of current interest in exploring possibilities available in "eScience", webbyscience and "Science 2.0"; although I dislike that last term! The recent Nature articles on these topics are also definitely worth reading.

Posted by ldodds at 02:11 PM | Feedback?

December 12, 2005

OpenDocument and XMP

This is the second part of my look at XMP. This time I'm focusing on the potential for using XMP as the metadata format for OpenDocument (ODF).

This is part of a broader discussion to help define the future direction for the ODF metadata format, one proposal on the table is to use RDF, via a constrained RDF/XML syntax. There's a wiki available for discussing this issue, particularly how to map existing metadata to RDF.

At least some of the impetus for exploring richer metadata support has come from the bibliographic sub-project which aims to build-in support for bibliography management into OpenOffice 3.0.

RDF is a good fit for the flexible storage and formatting requirements that arise from bibliographic metadata. As XMP is an RDF profile its worthy of consideration, and in fact this is the proposal behind Alan Lilich's posting to the OpenDocument TC member list. Lilich's discussion document frames the rest of this posting.

Measuring Fit

Lilich's presumptions and bias are all very pragmatic, Here's something that I particularly agree with:

The completeness and quality of all applications, commercial or open source, depends quite a bit on the clarity and implementability of the OpenDocument specification. It needs to be easily, reliably, and consistently implemented.

This is probably the crux of the debate with respect to deciding whether to "import" a third-party specification as a portion of the ODF specification. In other words, does XMP meet all of these goals?

While Adobe does have a C++ toolkit, there aren't any other open implementations of the specification. The availability of several, independent, conformant implementations of a specification is a pre-requisite for the safe adoption of any technology. In my view, and irrespective of any of its other merits, XMP immediately fails on this point.

As to the "easily and reliably" aspects to be able to measure those one needs to gather implementation experience from developers, ideally backed with a conformance suite against which those implementations can be compared. Again XMP doesn't, yet, measure up here, despite the specification having been available for several years.

To remedy this Adobe need to invest some additional effort in shepherding the specification to encourage wider comment and implementation. If they are unwilling to pursue this themselves, then moving the maintenance of the specification to a more open forum, e.g. an OASIS Technical Committee, would allow the community to organize itself better. This model is working for OpenDocument, and I don't see why it couldn't work for XMP.

My understanding is that there are existing concerns amongst vendors about consistent levels of XMP conformance and the lack of a formal schema for validating XMP documents. These concerns may ultimately result in a community led initiative to shepherd XMP. Either way the OpenDocument TC would do well to investigate further the experiences of vendors using XMP to highlight possible pitfalls before adopting the technology wholesale.

Lilich's suggestion that:

...the OpenDocument metadata effort could succeed by starting with XMP, understanding how to work within XMP, and only looking for truly necessary changes.

Seems backward to me: the real utility of XMP ought to be demonstrated first before the OpenDocument TC look at re-framing their goals to work within whatever process Adobe are laying out for the future direction of XMP. Suggesting that the TC only look for "truly necessary changes" runs counter to the need to "get it right". XMP ought to be shown to be unequivocally better than the alternatives before making any technical compromises. A clear understanding of the trade-offs is necessary here.

The section of Lilich's post entitled "Latitude for change" does a good job of setting how much room for compromise there is really available. The reasoning here is entirely understandable from Adobe's perspective, but should be of concern to any vendor uncomfortable with technical aspects of the format.

In my previous posting I suggested that the real benefit of XMP is in its definition of how to embed metadata within a number of arbitrary binary formats. I stand by that. Within the evaluation of its using within the OpenDocument format, I believe its key value is as a demonstration that an RDF-based metadata model is achievable.

Validation and Conformance

In my previous posting I raised some concerns over the interesting ways that XMP subsets RDF. It's clear that XMP subsets both the RDF model and the RDF/XML syntax.

The cleanest way to create an easibly processable (i.e. by both RDF and XML toolkits) RDF subset is to apply constraints at the syntax level. This has several benefits.

Firstly, by making the format validatable with an XML schema benefits conformance and allows developers to use tools like XSLT to perform basic manipulations of the data. RELAX NG seems to be the best fit for a schema language here.

A fixed XML serialization is also important in order to maintain the hackability of the OpenDocument format, without necessarily having to compromise on the richness of the metadata model.

Secondly, XML schemas can provide the hooks to enable "semantic anchors" that can help application authors tame some of the wildness of the RDF model.

Constraining syntax is the general intent behind the "Plain XMP" format that Lilich describes at the end of his posting. I think Adobe are missing an opportunity here: they can constrain the RDF/XML syntax without having to produce an alternate serialization. It would be interesting to know whether this approach has already been rejected.

To support the XML (RELAX NG) schemas, XMP ought to include equivalent RDF schemas for all of its additional properties, plus those for extension schemas where there aren't already public equivalents.

And, where the specification does make use of Classes and Properties from existing public schemas, it should respect the definitions in those schemas. XMP clearly doesn't do this for the Dublin Core properties such as dc:creator and dc:subject, where it requires using of RDF collections instead of repeated attributes. This does nothing to help interoperability.

Lilich suggests that when using repeated attributes "[c]lient application code becomes more complex and UI design more difficult if everything is potentially an array". I'm not convinced that the complexities are really that great, certainly not from an API design perspective. Schema cues can help address these concerns.

If there really are significant issues with using simple repeated literal properties then XMP, or the OpenDocument TC, should define new properties, or suggest that the relevant community extend existing schemas. One of the power of RDF is that this kind of schema evolution can happen in a distributed fashion.

Markup Escaping

I was shocked to see the suggestion that escaped markup in XML elements is an acceptable solution. OpenDocument should not recommend any format that uses markup escaping. To do so would undermine the whole benefit of the OpenDocument format being expressed as XML.

I'll just point to Norm Walsh's essay "Escaped Markup: Still Harmful" for further comment on that anti-pattern.

Qualifiers in XMP

XMP allows "qualifiers" on attributesl; essentially these are "properties of properties". In an RDF context the property is simply a Resource which can then be annotated with multiple properties. In Lilich's example, an author name may be "annotated" with the location of the authors blog.

This is one area where XMP would do well to align itself more closely with RDF, and explicitly model the relevant properties as Resources from the outset.

While the qualifiers may be transparent and easy to use from an XMP context, they add further confusion to the RDF export: without a qualifier a dc:creator property may be a simply Literal, but add a qualifier and its becomes a Resource.

This leads to precisely the problems described here. In fact it has the DC folks in a bit of a crisis.

As I've already mentioned that XMP's use of RDF collections compounds this problem.

Conclusions

To sum up, my personal opinion on this is that XMP is not a good fit with the OpenDocument format. There are reasons to explore reliable conversions to and from XMP, but I don't see enough compelling reasons to adopt XMP in its current form.

The latitude for change to the XMP format itself seems very small, so opportunities for adopting the format now, and working on improving it later also seem slim.

XMP is a reasonably good examplar of an RDF (or near RDF) based model for document metadata, that can be used by both XML and RDF tools. But the technical issues outlined above limit its general utility. It's disregard for already published schemas (Dublin Core) and best practices (e.g. markup escaping) are of a concern.

Personally I'd be interested to see an open specification that built on the XMP experiences to separate out its different aspects (model, syntax, and format embedding) with a view to encouraging wider implementation and conformance.

From the OpenDocument perspective I think there is definite value in exploring an RDF interpretation of its metadata. Just as XMP does, this model can build on existing schemas such as Dublin Core, and possibly PRISM to avoid having to create a whole new set of schemas.

Posted by ldodds at 01:24 PM | Feedback?

December 08, 2005

Looking at XMP

I've been taking a look at XMP as I've been considering different ways to "enrich" content. Embedding metadata is one option and XMP aims to fulfill the role of a metadata format suitable for embedding in a diverse range of media formats.

It's also under discussion as way to embed metadata in the OpenDocument format. The alternatives available in that quarter have been under discussion in various circles for some time. Bruce D'Arcus points to the latest entry to that discussion in his recent "OpenDocument and XMP" posting.

I thought I'd write up some notes on XMP in general and contribute some thoughts towards that debate. This is the first of two postings on this topic.

Tools

After speed reading Bob DuCharme's XMP Lowdown article to get myself oriented, my first port of call were the Adobe XMP developer resources: I wanted to get my hands dirty working with the technology and needed some tools. After sifting through the site and the forums all I could find as the C++ toolkit; not much use to me as a Java developer. Extending my search to Google the best I could find was this regex(!) for extracting XMP documents.

That wasn't a promising start. I know that XMP has been around for a number of years and I'd expect there to be more tool support from Adobe. Or failing that from the broader development community. I gather than XMP is well supported in Adobe's own products and that a number of other vendors (e.g. of content management and asset tracking products) but it certainly hasn't garnered much interest open source circles.

XMP and RDF

Turning to the specification I was encouraged to find that XMP is based around RDF. It's an RDF profile of sorts, although it opts for some rather quirky restrictions on the allowed RDF/XML syntax. Syntactic profiles of RDF don't scare (or surprise) me, but this one left me with raised eye-brows. Rather than constraining the syntax to a fixed XML format, one that could be validated against an XML schema but still retain an RDF interpretation, the restrictions are placed elsewhere.

For example, in XMP one isn't allowed to use typed nodes and all children of the rdf:RDF element must be rdf:Description elements. Fair enough. But the specification states that a single Description element can only contain properties from a single namespace. So if you're mixing, say, XMP properties with Dublin Core and PRISM, then you're forced to use one rdf:Description element per namespace. I can't see the advantages here as an application could simply ignore what it didn't understand.

XMP, despite having been revised in June 2005, also seems to be based on much earlier versions of RDF. The specification references to RDF features that were removed in 2001. It also requires the use of the rdf:RDF element which is now an optional part of the syntax.

I was more concerned about recommendations about how some metadata should be encoded. Particularly the use of rdf:Bag, rdf:Alt, and rdf:Seq. Current best practice (if thats not too strong a term) is to use simply use repeated properties for many of the cases that the XMP specification discusses; alternate languages for articles titles for example. It simplifies both the syntax and working with the data in an RDF application.

XMP requires the use of an rdf:Seq in order to express multiple authors of a document. In other words it recommends using dc:creator as if it were defined to be a sequence rather than a simple literal value. Working with bibliographic metadata I understand the need to define ordering amongst authors, but not at the cost of deepening the confusion over using dc:creator.

The XMP Lowdown article describes how perfectly valid RDF data is forced into this particular model to the extent that its not correctly round tripped via XMP tools. Implying an ordering where one hasn't been stated originally seems like a bad idea to me.

So really XMP isn't a profile of RDF: its a separate data model that happens to use RDF/XML as a serialization mechanism because its a close fit.

I think there are some benefits being lost here. It wouldn't take much to bring the XMP and RDF models closer together, and still gain the benefits of both predictable structures for applications and the RDF model itself.

XMP and XML

I was also surprised to discover that XMP is also a profile of XML.

An XMP document cannot include an XML declaration. This alters the definition of well-formedness from the XML specification. In the section on how to embed XMP within SVG, there's this note:


An XMP Packet is not intended to be a complete standalone XML document; therefore it contains no XML declaration.

Without an XML declaration there's no way to declare the encoding of the XMP document. The XMP equivalent of the XML declaration does including an encoding attribute, but its deprecated, meaning that to my understanding one has to determine the specific encoding via other means.

Perhaps this is consequence of defining a format for embedding in binary documents, but it certainly seems like an odd decision. I always hear alarm bells when I see XML being redefined in this way.

XMP also requires a value of x-default for the "default" language when defining alternatives. While this is permissible in XML and RFC 3066, its hardly portable. Default in what (and whose) context?

The lack of a formal schema for XMP (of any variety) also seems a huge oversight.

The Goals of XMP

Disagreements over technical minutae aside, I do see some real value in XMP. The ability to embed metadata in arbitrary binary document formats is a huge benefit. This is the real core of XMP and its primary use case.

Avoiding having to package up content and metadata makes many application much simpler, especially as there's no formal XML packaging specification.

For formats that are already XML, and/or already have well-defined packaging mechanisms, I'm not clear on the immediate benefits of XMP. It's quirks from both an XML and RDF perspective, and its lack of tool support, make it a less than ideal choice IMO.

More on XMP and OpenDocument to follow.

Posted by ldodds at 08:54 PM | Feedback?