I've been meaning to have a play with the POI API for some time now. So, when a colleague mentioned how easy it is to work with, I decided it was high time I had a look. Whilst thinking of a suitable utility it occured to me that Office documents have metadata stored in them (see the File -> Properties dialog), and so I wondered whether it would be able to extract this data as RDF.
The result is MORE (Microsoft Office RDF Extractor).
The tool is a simple command-line utility that generates an RDF document from one or more Office documents. Access to the embedded properties is made possible by the POI HPSF API, while the RDF manipulations are performed by Jena. So you'll need these classes in your CLASSPATH before running the application.
The command-line is simple:
java com.ldodds.more.MORE -help
...will get you a usage message describing the available properties. To summarise, it's possible to extract RDF from several documents in one go, add RDF statements to an existing RDF document, and dump the results to a file rather than the console which is the default.
The key part of MORE is the "mapping schema". This is a concept that I've borrowed (read: "stolen") from Norman Walsh's rdfjpeg utility, which I've also been tinkering with lately. A mapping schema is basically just an RDF Schema that contains a number of rdf:Property elements. Each of these properties are annotated by a more:pidString property as follows:
<rdf:Property rdf:about="http://purl.org/dc/elements/1.1/title">
<rdfs:label>Title</rdfs:label>
<more:pidString>PID_TITLE</more:pidString>
</rdf:Property>
Here's a complete example schema.
Office documents store their metadata as name-value pairs. These property names are either "built-in", these all start with the prefix "PID_", or are defined by the user in the Custom tab of the File -> Properties dialog in the application (actually I'm glossing over a lot of details here, see the HPSF internals document for the ugly truth; HPSF makes things easy to handle). The pidString properties in the mapping schema are therefore just the names of metadata elements stored in a Word, Excel or Powerpoint document.
Upon encountering an item of metadata, MORE examines its mapping schema to determine which RDF properties it should add to the resulting RDF. The example mapping schema in the download shows how to create both Dublin Core and custom RDF properties. If an item of metadata doesn't have an entry in the mapping schema then its just discarded, making it very easy to customise the tool to produce the output you desire. Also, if a property value starts with "http" or "mailto" then an rdf:resource element is generated rather than a literal.
Feedback is very welcome, particularly if it doesn't work for you or there are bugs! (One thing I'm not sure about is how best to assign a URI to each document resource. I've defaulted to just using the file name, because that's what jpegrdf does, and if its good enough for Norm...)
While I've no firm plans to extend this tool further -- for me it's just another step down the road in learning various RDF tools and technologies -- I may add sensible new features if suggested. However I consider the code to be Public Domain (it's pretty trivial after all) so feel free to do with it what you will.
Posted by ldodds at July 1, 2003 09:45 PM | Feedback? | | TrackBackDistantly related: http://www.computerbytesman.com/privacy/blair.htm
"Microsoft Word documents are notorious for containing private information in file headers which people would sometimes rather not share. The British government of Tony Blair just learned this lesson the hard way. "
...can you extract revision history metadata with MORE?
Posted by: Dan Brickley on July 2, 2003 11:09 AMRevision history information, other than "date last modified", doesn't seem to be available through POI. Or at least I don't see anything in the javadoc or documentation anyway.
What use case were you thinking of?
how do you know my name? http://www.yahoo.com, http://www.slashdot.org
Posted by: Semen Prostyakov blog on August 2, 2005 04:19 AMhz hz hz hz blog http://www.apple.com, http://www.apple.com
Posted by: Keanu Reaves blog on August 2, 2005 04:21 AMHow can you do this? http://www.yahoo.com/r/sq , http://www.apple.com
Posted by: Brad Pitt blog on August 2, 2005 12:25 PM