10
Jan 10

Thoughts on Linked Data Business Models

Scott Brinker recently published a great blog post covering 7 business models for Linked Data. The post is well worth a read and reviews the potential for both direct and indirect revenue generation from a range of different business models. I’ve been thinking about these same issues myself recently so I’m pleased to see that others are doing similar analysis. Scott’s conclusion that, currently, Linked Data is more likely to drive indirection revenue is sound, and reflects where we are with the deployment of the technology.

The time is ripe though for organizations to begin exploring direct revenue generation models and it’s there that I wanted to add some thoughts and commentary to Scott’s posting.

Traffic

The traffic model, with its indirect revenue generation by driving traffic to existing content and services, is well understood. The same model has been used to encourage organizations to open up Web APIs, so its natural to consider this for Linked Data also.

Because it is tried and tested it’s currently one of the strongest arguments for driving adoption of Linked Data, so I’d put this right at the top of the list. The feedback loop that is in place now with search engines makes that traffic generation a reality.

Advertising

Scott mentions adverts as a possible revenue stream and raises the possibility of “data-layer ads”, by which I understand him to mean advertising included in the Linked Data itself. While I agree that an advertising model is a potential revenue stream, I don’t see that “data-layer ads” are really viable or actually useful in practice.

Adverts incorporated into raw data will be too easily stripped out or ignored by applications; by definition the adverts will be easily identifiable. RSS advertising doesn’t seem to have really taken off (I certainly never see them anyway) and I think this is for similar reasons: if the adverts are easily identifiable, then they can be stripped. And if they’re included in content or data values, then this causes problems for further machine-processing of the data and annoyances for end users.

Of course a business could enforce that users of its Linked Data should display ads through its terms and conditions, e.g. requiring data-layer ads to be displayed in some form to users of an application. In practice this can get problematic, especially if there’s not an obvious way to surface the ads to end users. But I think its also problematic as unlike a Web API where I sign up to gain access, for an arbitrary Linked Data site, there is no prior agreement required. My crawler or browser might fetch data without any knowledge of what those terms and conditions might be.

Adverts embedded into data is are not a useful way to distribute them to end-users. In an environment where adverts are increasingly profiled by a range of geographic, demographic or behavioural factors, incorporating blanket ads into data feeds loses all of that targetting capability. It also potentially loses the feedback, e.g. on click-throughs or impressions, that are useful for gauging the success of a campaign.

In my view advertising as a model to support Linked Data publishing is more likely to echo that used by the Guardian as part of its Open Platform terms and conditions (See Section 8, Advertising and Commercial Usage). The terms require users of the content to display ads from Guardian’s advertising network on its website. This avoids the need to include adverts in the data layer and supports a conventional model for delivering ads, making it play well with current advertising platforms and targeting options.

Subscriptions

As Brinker notes, subscription models for data, content and services have been around for some time. The interesting thing is to see how these models have been evolving of late due to pressures in various industries, and how these intersect with the open data movement. For Linked Data to be most useful some of its needs to be free: you need to make at least a bare minimum of data freely available, e.g. to identify objects of interest, to enable annotation and linking, etc. In my opinion a freemium model is the core of any subscription model for Linked Data.

Having previously worked in the academic publishing industry which is very heavily driven by subscription revenues, I’ve noticed a number of models that have come to the fore there, most recently driven by the Open Access movement. I think many of these are transferable to other contexts. So while the particulars will vary in different industries, the means of slicing up data into subscription packages are likely to be repeatable.

All of the following assume that some basic element of the Linked Data is free, but that one is paying for:

  • Full Access — Pay for access to detailed, denser data. The value-added data might include richer links to other datasets, more content, etc
  • Timely Access — Pay for access to the most recent, or more current version of the data. This leaves the bulk of the data open but delivers a commercial advantage to subscribers. As data gets older, it automatically becomes free
  • Archival Access — Putting archives of content, or large archival datasets on-line can be expensive in terms of data conversion, digitization, and service provision. So deep archives of data might only be available to subscribers. Commercial advantage derives from having more data to analyse and explore.
  • Block Access — paying for access to a dataset based on time, e.g. “for the next 24 hours”; or based on the number, frequency of accesses; or the number of concurrent accesses.
  • Convenient Access — paying for access to the data through a specific mechanism. This might seem at odds with Linked Data, but its reasonable to assume that some organizations might want data feeds or dumps rather than on-line only access. This might come at a premium.

These variants can combined and might also be separated out into personal (non-commercial) and commercial subscription packages.

It’s interesting to see how some of these (Timely Access, Convenient Access) are already in use in projects like Musicbrainz that blend Open Data with commercial models.

Sponsorship

One model that Scott Brinker doesn’t mention in his posting is Sponsorship. An organization might be funded to publish Linked Data, e.g. for the public good. The organization itself might be a charity and funded by donations.

It’s arguable that this might be more about cost recovery for service provision rather than a true business model, but I think its worth considering. Some of the open government data publishing efforts and possibly even the Linked Data from the BBC, could be seen as falling into this category.

It’s probably most viable for public sector, cultural heritage and similar organizations.

Closing Thoughts

What needs to happen to explore these different models? Is it just a matter of individual organizations experimenting to see what works and what doesn’t?

I think that is largely the case, and we’ll definitely be seeing that process begin to happen in earnest in 2010; a process that we’ll be supporting and enabling with the Talis Platform.

From a technical perspective I’m interested to see how well protocols like OAuth and FOAF+SSL can be deployed to mediate access to licensed Linked Data.


27
Dec 09

Thoughts on Enterprise Linked Data

There have been a number of discussions about “Enterprise Linked Data” recently, and I took part on a panel on precisely that topic at ESTC 2009. Unfortunately the panel was cut short due to time pressures so I didn’t get chance to say everything I’d hoped. In lieu of that debate here’s a blog post containing a few thoughts on the subject.

When we refer to enterprise use of Linked Data, there are a number of different facets to that discussion which are worth highlighting. In my opinion the issues and justifications relating to each of them are quite different. So different in fact that we’re in danger of having a confused debate unless we tease out this different aspects.

Aspects of the Debate

In my view there are three facets to the discussion:

  • Publishing Linked Data, the key question here being: What does an Enterprise have to benefit by publishing Linked Data?
  • Consuming Linked Data: What does an Enterprise have to benefit from consuming Linked Data?
  • Adopting Linked Data: What benefits can an Enterprise gain by deploying Linked Data technologies internally?

I think these facets whilst obviously closely related are largely orthogonal. For example I could see a scenario in which an organization consumed Linked Data but didn’t store or use it as RDF, but just fed it into existing applications. Similarly businesses could clearly adopt Linked Data as a technology without publishing or using any data to the web at all.

These issues are also largely orthogonal to the Open Data discussion: an enterprise might use, consume and publish Linked Data but this might not be completely open for others to reuse. The data may only be available behind the firewall, amongst authorised business partners, or only available to licensed third-parties. So, while the issue as to whether to publish open data is a very important aspect of the discussion, its not a defining one.

Here’s a few thoughts on each of these different facets.

Publishing Linked Data

So why might an enterprise publish Linked Data? And if that is a worthwhile goal, then is it clear how to achieve it? Lets tackle the second question first as its the simplest.

There is an increasingly large amount of good advice available online, as well as tools and applications, to support the publishing of Linked Data. We’re making good strides towards making the important transition from moving Linked Data out of the research area and into the hands of actual practitioners. The How to Publish Linked Data on the Web tutorial is an great resource but to my mind Jeni Tennison’s recent series on publishing Linked Data is an excellent end-to-end guide full of great practical advice.

We can declare victory when someone writes the O’Reilly book on the subject and do for Linked Data what RESTful Web Services did for REST. (And the two would make great companion pieces).

But technology issues aside, what are the benefits to an organization in publishing Linked Data? There are several ways to approach answering that question but I think in most discussions Linked Data tends to get compared with Web APIs. The value of creating an API is now reasonably well understood, and many of the benefits that come from opening data through an API also apply to Linked Data.

However the argument that Linked Data married with a SPARQL endpoint is as easy for developers to use as a Web API is still a little weak at this stage. SPARQL can be off-putting for developers used to simpler more tightly defined APIs. As a community we ought to consider it as a power tool and look for ways to make it easier to get started with. It’s also worth recognising that a search API is also a useful addition to a SPARQL endpoint as part of Linked Data deployment.

But publishing Linked Data can’t be directly compared to just creating an API, because its also largely a pattern for web publishing in general. Its increasingly easier to instrument existing content management systems to expose RDF(a) and Linked Data. So rather than create a custom API, which will involve expensive development costs, particularly if its going to scale, its possible to simply expose Linked Data as part of an existing website.

By following the Linked Data pattern for web publishing, in particular the use of strong identifiers, an enterprise can end up with a single point of presence on the web for publishing all of its human and machine-readable data, resulting in a website that is strongly Search Engine Optimised. Search engines can better crawl and index well structured websites and are increasingly ingesting embedded RDFa to improve search results and rankings. That’s a strong incentive to publish Linked Data by itself.

Adopting Linked Data, particularly as part of a reorganization of an existing web presence, could deliver improved search engine rankings and exposure of content whilst saving on the costs of developing and running a custom API. The longer term benefits of being part of the growing web of data can be the icing on the cake.

Consuming Linked Data

Next we can consider why an enterprise might want to consume Linked Data.

To my knowledge organizations are currently only publishing Linked Open Data (albeit with some wide variations in licensing terms), so we’ll skip for the present whether enterprises have an option of consuming non-open Linked Data, e.g. as part of a privately licensed dataset.

The LOD Cloud is still growing and provides a great resource of highly interlinked data. The main issues that face an organization consuming this data are ones of quantity (there’s still a lot more data that could be available); quality (how good is the data, and how well is it modelled); and trust (picking and choosing reliable sources).

To some extent these issues face any organization that begins relying on a third-party API or dataset. However at present a lot of the data in the LOD cloud is still from secondary sources. The same can’t be said for the majority of web APIs, which tend to be published by the original curators of the data.

These issues should resolve themselves over time as more primary sources join the LOD cloud. Because Linked Data is all based on the same data model bulk loading and merging data from external sources is very simple. This gives enterprises the option of creating their own mirrors of LOD data sources which will provide some additional reassurances around stability and longevity.

Linked Data, with its reliance on strong identifiers, is much easier to navigate and process than other sources, even if you’re not storing the results of that processing as RDF. There’s also a much greater chance of serendipity, resulting in the discovery of new data sources and new data items. Whereas there is virtually no serendipity in a Web API as each API needs to be explicitly integrated.

But this benefit is only going to become evident if we continue to put effort into helping (enterprise) developers understand how to consume Linked Data. E.g. as part of existing frameworks or using new data integration patterns is another area that needs more attention. The Consuming Linked Data tutorial at ISWC 2009 was a good step in that direction, although the message needs to be circulated wider, outside of the core semantic web community.

In my opinion it will be easier for enterprises to consume Linked Data if they first begin to publish it. By publishing data they are putting their identifiers out into the wild. These identifiers become points for annotation and reuse by the community, creating liminal zones from which the enterprise can harvest and filter useful data. This is a benefit that I think is unique to Linked Data as with an Web API the end results are typically mashups or widgets displaying in a third-party application; these are just new silos one step removed from the data publisher.

Adopting Linked Data

Finally, what value could be gained if an organization adopts Linked Data internally as a means to manage and integrate data behind the firewall?

The issues and potential benefits here are largely a mixture of the above, except that there are little or no issues with trust as all of the data comes from known sources. In a typical enterprise environment Linked Data as an integration technology will be compared to a wider range of systems ranging from integrated developer tools through to middleware systems. There’s a reason why SOAP based systems are still well used in enterprise IT as most organizations aren’t (yet?) internally organized as if they were true microcosms of the web.

Its interesting to see that Linked Data can potentially provide a means for solving many of the issues that Master Data Management is trying to address. Linked Data encourages strong identifiers; clean modelling; and linking to, rather than replicating data. These are core issues for data consolidation within the enterprise. Coupled with the ability to link out to data that is part of the LOD Cloud, or published by business partners, Linked Data has the potential to provide a unifying infrastructure for managing both internal and external data sources.

Its worth noting however that semantic technologies in general, e.g. document analysis, entity extraction, reasoning and ontologies seem to be much more widely deployed in enterprise systems than Linked Data. This is no doubt in large part because the advantages of those technologies may currently be much more easily articulated as they’re more easily packaged into a product.

Summary

In this post I wanted to tease out some of the questions that underpin the discussions about enterprise adoption of Linked Data. I’ve presented a few thoughts on those questions and I’d love to hear your opinions.

Along the way I’ve attempted to highlight some areas where we need to focus to help transition from a researcher-led to a practioner-led community. More data, more documentation, and more tools are the key themes.


21
Dec 09

SPARQL Extension Function Survey Summary

This post contains the first set of results from my SPARQL extension survey. I’ve completed an initial survey of a number of different SPARQL processors to itemise the extension functions that each of them have implemented. This will be an ongoing activity as implementations evolve continually, but I thought it would be useful to summarise my findings so far.

If you want to look at the results for yourselves, then I’ve created a publically accessible Google Spreadsheet that lists all of the results. The first tab of the spreadsheet includes the list SPARQL endpoints/processors that I’ve surveyed.

I completed the initial round of the survey a few weeks ago, so any updates since then won’t have been included.

List of Implementations

The full list of surveyed processors/endpoints consists of:

  • Allegrograph
  • ARQ
  • Corese
  • Geospatialweb project
  • Mulgara
  • OpenAnzo
  • Openlink Virtuoso
  • Sesame
  • TopBraid product suite
  • XMLArmyKnife.com

If I’ve missed any other implementations that support extension functions then please let me know. I’m aware that other engines also support property functions, but I’ve not included this type of extension in my first survey round. I’ll be exploring that area in the new year.

I want to thank the implementers of a number of these systems for providing me with additional information, feedback and support as I’ve compiled the results. If anything has been misrepresented or simply missed, then you have my apologies and I will endeavour to fix any reported problems ASAP. The goal is to perform a fair, objective survey of the current situation: I’m not pushing any agenda here, other than a desire for convergence and continual improvement.

Breakdown of Results

The currently implemented extension functions can be organised into the following categories:

  • String
  • Date/Time
  • Math/Logic
  • RDF/Graph Manipulation
  • Geospatial
  • Network

The first three categories, covering string, date, and mathematical manipulations have the largest number of functions. This is as expected as these areas are the most useful for any programming or query language. Given that extension functions are restricted to value testing in SPARQL 1.0, then you would also assume that they would be most commonly used to provide additional flexibility when comparing strings, manipulating and comparing dates, and performing simple mathematical functions.

Very few implementations offer any functions in the remaining categories. I had originally expected to find more functions in the Geospatial category but I think that the majority of exploration in that area has focused on using property functions instead.

I would expect to see the number of distinct functions in each area to grow with the delivery of SPARQL 1.1, if it becomes possible to use them as part of a SELECT expression, e.g. to create new values/bindings, as well as just in FILTER tests. Those implementations that already offer a wide range of additional functions, such as Virtuoso, already have additional SPARQL language extensions that allow functions to be used in this way.

Currently however the numbers are inflated due to repeated implementation of the same function in different engines. For example ARQ, Virtuoso and Corese all have their own variant of a “contains” function.

Portability

This brings me to the topic of query portability. A SPARQL query is portable if it can run unchanged on any SPARQL processor. A query is not portable if it uses proprietary extensions that are not supported on other processors. Implementers can increase portability by supporting each others extensions or by converging on a common set of functions. As a standard develops, you’d expect to see some replication of functions across engines before pressures from users, and a better understanding of the utility of various extensions, encourages convergence.

It’s encouraging to see that some replication of functions is happening across SPARQL engines. For example both Mulgara and TopQuadrant support a basic set of string functions that were originally provided by the ARQ engine. These functions are part of the XPath Functions and Operators library which acts as a handy “off-the-shelf” set of function definitions for SPARQL implementors to converge around. Mulgara also now supports a number of the EXSLT functions which can act as another reference point for useful function definitions.

Looking at the list of extensions, its easy to see that more convergence could take place as there are plenty of other extension functions that have been independently implemented. Expanding the set of commonly used functions in SPARQL is currently a time-permitting feature for SPARQL 1.1.

Replication of functions across implementations is partially hampered because of a couple of non-standard ways that extension functions have been implemented. For example both Corese and Virtuoso implement their extension functions as language extensions, i.e. they don’t quite conform to the SPARQL 1.0 recommendation. Corese doesn’t associate its functions with a URI, i.e. they are just functions that are exposed in the basic language. The Virtuoso “bif” (built-in) functions are used with a prefix (e.g. bif:contains) but this prefix is not (and cannot) be associated with a URI. In both cases this means that implementations cannot replicate the functions using existing extension points: they’d have to be implemented with similar language extensions, or query rewriting.

Conclusions and Recommendations

I’m encouraged to see the wide range of experimentation that has been taking place around SPARQL extensions as it illustrates that developers are exploring how to use the language in a variety of ways. Extensions also indicate areas where the query language could be extended to encourage interoperability and address common issues faced by developers.

There are clearly a common set of functions around strings, dates and mathematical operators that ought to be available as a core part of the language. If the SPARQL 1.1 specification doesn’t end up defining this then I’d like to encourage the implementer community to do further work to explore replicating useful extensions or converging on a common set outside of the Working Group.

To help this process along it would be useful for developers to provide more feedback on the functions they provide useful, and for some statistics to be gathered around which functions are being commonly used in practice.

Right now there are a common set of functions available from the ARQ engine that are implemented in at least two other SPARQL processors. The same functions can be ported to other engines with a minimum of query rewriting, often with little more than changes to query prefixes.

My other recommendation at this stage is that implementers need to work harder on documenting the extensions they provide. Some engines have pretty good documentation, but for others the documentation is either hard to find or clearly lagging behind the latest code base. Publishing documentation about extensions, ideally with examples, really does help developers get started much quicker.


20
Dec 09

Approaches to Publishing Linked Data via Named Graphs

This is a follow-up to my previous post on managing RDF using named graphs. In that post I looked at the basic concept of named graphs, how they are used in SPARQL 1.0/1.1, and discussed RESTful APIs for managing named graphs. In this post I wanted to look at how Named Graphs can be used to support publishing of Linked Data.

There are two scenarios I’m going to explore. The first uses Named Graphs in a way that provides a low friction method for publishing Linked Data. The second prioritizes ease of data management, and in particular the scenario where RDF is being generated by converting from other sources. Lets look at each in turn and their relative merits.

Publishing Scenario #1: One Resource per Graph

For this scenario lets assume that we’re building a simple book website. Our URI space is going to look like this:


http://www.example.org/id/{thing}/{id}
http://www.example.org/doc/{thing}/{id}

The first URI being the pattern for identifiers in our system, the second being the URI to which we’ll 303 clients in order to get the document containing the metadata about the thing with that identifier. We’ll have several types of thing in our system: books, authors, and subjects.

The application will obviously include a range of features such as authentication, registration, search, etc. But I’m only going to look at the Linked Data delivery aspects of the application here in order to highlight our Named Graphs can support that.

Our application is going to be backed by a triplestore that offers an HTTP protocol for managing Named Graphs, e.g. as specified by SPARQL 1.1. This triplestore will expose graphs from the following base URI:

http://internal.example.org/graphs

The simplest way to manage our application data is to store the data about resource in a separate Named Graph. Each resource will therefore be fully described in a single graph, so all of the metadata about:

http://www.example.org/id/book/1234

with be found in:

http://internal.example.org/graphs/book/1234

The contents of that graph will be the Concise Bounded Description of http://www.example.org/id/book/1234, i.e. all its literal properties, any related blank nodes, as well a properties referencing related resources.

This means delivering the Linked Data view for this resource is trivial. A GET request to http://www.example.org/doc/book/1234 will trigger our application to perform a GET request to our internal triplestore at http://internal.example.org/graphs/book/1234.

If the triplestore supports multiple serializations then there’s no need for our application to parse or otherwise process the results: we can request the format desired by the client directly from the store and then proxy the response straight-through. Ideally the store would also support ETags and/or other HTTP caching headers which we can also reuse. ETags will be simple to generate as it will be easy to track whether a specific Named Graph has been updated.

As the application code to do all this is agnostic to the type of resource being requested, we don’t have to change anything if we were to expand our application to store information about new types of thing. This is the sort of generic behaviour that could easily be abstracted out into a reusable framework.

Another nice architectural feature is that it will be easy to slot in internal load-balancing over a replicated store to spread requests over multiple servers. Because the data is organised into graphs there are also natural ways to “shard” the data if we wanted to replicate the data in other ways.

This gets us a simple Linked Data publishing framework, but does it help us build an application, i.e. the HTML views of that data? Clearly in that case we’ll need to parse the data so that it can be passed off to a templating engine of some form. And if we need to compose a page containing details of multiple resource then this can easily be turned into requests for multiple graphs as there’s a clear mapping from resource URI to graph URI.

When we’re creating new things in the system, e.g. capturing data about a new book, then the application will have to handle any newly submitted data, perform any needed validation and generate an RDF graph describing the resource. It then simply PUTs the newly generated data to a new graph in the store. Updates are similarly straight-forward.

If we want to store provenance data, e.g. an update history for each resource, then we can store that in a separate related graph, e.g. http://internal.example.org/graphs/provenance/book/1234.

Benefits and Limitations

This basic approach is simple, effective, and makes good use of the Named Graph feature. Identifying where to retrieve or update data is little more than URI rewriting. It’s well optimised for the common case for Linked Data, which is retrieving, displaying, and updating data about a single resource. To support more complex queries and interactions, ideally our triplestore would also expose a SPARQL endpoint that supported querying against a “synthetic” default graph which consists of the RDF union of all the Named Graphs in the system. This gives us the ability to query against the entire graph but still manage it as smaller chunks.

(Aside: Actually, we’re likely to want two different synthetic graphs: one that merges all our public data, and one that merges the public data + that in the provenance graphs.)

There are a couple of limitations which we’ll hit when managing data using the scenario. The first is that the RDF in the Linked Data views will be quite sparse, e.g the data wouldn’t contain the labels of any referenced resources. To be friendly to Linked Data browsers we’ll want to include more data. We can work around this issue by performing two requests to the store for each client request: the first to get the individual graph, the second to perform a SPARQL query something like this:


CONSTRUCT {
 <http://www.example.org/id/book/1234> ?p ?referenced.
 ?referenced rdfs:label ?label.
 ?referencing ?p2 <http://www.example.org/id/book/1234>.
 ?refencing rdfs:label ?label2.
} WHERE {
 <http://www.example.org/id/book/1234> ?p ?referenced.
 OPTIONAL {
   ?referenced rdfs:label ?label.
 }
 ?referencing ?p2 <http://www.example.org/id/book/1234>.
 OPTIONAL {
   ?refencing rdfs:label ?label2.
 }
}

The above query would be executed against the union graph of our triplestore and would let us retrieve the labels of any resources referenced by a specific book (in this case), plus the labels and properties of any referencing resources. This query can be done in parallel to the request for the graph and merged with its RDF by our application framework.

The other limitation is also related to how we’ve chosen to factor out the data into CBDs. Any time we need to put in reciprocal relationships, e.g. when we add or update resources, then we will have to update several different graphs. This could become expensive depending on the number of affected resources. We could potentially work around that by adopting an Eventual Consistency model and deferring updates using a message queue. This lets us relax the constraint that updates to all resources need to be synchronized, allowing more of that work to be done both asynchronously and in parallel. The same approach can be applied to manage list of items in the store, e.g. a list of all authors: these can be stored as individual graphs, but regenerated on a regular basis.

The same limitation hits us if we want to do any large scale updates to all resources. In this case SPARUL updates might become more effective, especially if the engine can update individual graphs, although handling updates to the related provenance graphs might be problematic. What I think is interesting is that in this data management model this is the only area in which we might really need something with the power of SPARUL. For the majority of use cases graph level updates using simple HTTP PUTs coupled with a mechanism like Changesets are more than sufficient. This is one reason why I’m so keen to see attention paid to the HTTP protocol for managing graphs and data in SPARQL 1.1: not every system will need SPARUL.

The final limitation relates to the number of named graphs we will end up storing in our triplestore. One graph per resource means that we could easily end up with millions of individual graphs in a large system. I’m not sure that any triplestore is currently handling this many graphs, so there may be some scaling issues. But for small-medium sized applications this should be a minor concern.

Publishing Scenario #2: Multiple Resources per Graph

The second scenario I want to introduce in this posting is one which I think is slightly more conventional. As a result I’m going to spend less time reviewing it. Rather than using one graph per resource, we instead store multiple resources per Named Graph. This means that each Named Graph will be much larger, perhaps including data about thousands of resources. It also means that there may not be a simple mapping from a resource URI to a single graph URI: the triples for each resource may be spread across multiple graphs, although there’s no requirement that this be the case.

Whereas the first scenario was optimised for data that was largely created, managed, and owned by a web application, this scenario is most useful when the data in the store is derived from other sources. The primary data sources may be a large collection of inter-related spreadsheets which we are regularly converting into RDF, and the triplestore is just a secondary copy of the data created to support Linked Data publishing. It should be obvious that the same approach could be used when aggregating existing RDF data, e.g. as a result of a web crawl.

To make our data conversion workflow system easier to manage it makes sense to use a Named Graph per data source, i.e. one for each spreadsheet, rather than one per resource. E.g:


http://internal.example.org/graphs/spreadsheet/A
http://internal.example.org/graphs/spreadsheet/B
http://internal.example.org/graphs/spreadsheet/C

The end result of our document conversion workflow would then be the updating or replacing of a single specific Named Graph in the system. The underlying triplestore in our system will need to expose a SPARQL endpoint that includes a synthetic graph which is the RDF union of all graphs in the system. This ensures that where data about an individual resource might be spread across a number of underlying graphs, that a union view is available where required.

As noted in the first scenario we can store provenance data in a separate related graph, e.g. http://internal.example.org/graphs/provenance/spreadsheet/A.

Benefits and Limitations

From a data publishing point of view our application framework can no longer use URI rewriting to map a request to a GET on a Named Graph. It must instead submit SPARQL DESCRIBE or CONSTRUCT queries to the triplestore, executing them against the union graph. This lets the application ignore the details of the organisation and identifiers, of the Named Graphs in the store when retrieving data.

If the application is going to support updates to the underlying data then it will need to know which Named Graph(s) must be updated. This information should be available by querying the store to identify the graphs that contain the specific triple patterns that must be updated. SPARUL request(s) can then be issued to apply the changes across the affected graphs.

The difficult of co-ordinating updates from the application with updates from the document conversion (or crawling) workflow means that this scenario may be best suited for read-only publishing of data.

Its clear that this approach is much more optimised to support the underlying data conversion and/or collection workflows that the publishing web application. The trade-off doesn’t add much more complexity to the application implementation, but doesn’t exhibit some of the same architectural benefits, e.g. easy HTTP caching, data sharding, etc, that the first model exhibits.

Summary

In this blog post I’ve explored two different approaches to managing and publishing RDF data using Named Graphs. The first scenario described an architecture that used Named Graphs in a way that simplified application code whilst exposing some nice architectural properties. This was traded off against ease of data management for large scales updates to the system.

The second scenario was more optimised data conversion & collection workflows and is particularly well suited for systems publishing Linked Data derived from other primary sources. This flexibility was traded off against slightly more complex application implementation.

My goal has been to try to highlight different patterns for using Named Graphs and how those patterns place greater or lesser emphasis on features such as RESTful protocols for managing graphs, and different styles of update language. In reality an application might mix together both styles in different areas, or even at different stages of its lifecycle.

If you’re using Named Graphs in your applications then I’d love to hear more about how you’re making use of the feature. Particularly if you’ve layered on additional functionality such as versioning and other elements of workflow.

Better understanding of how to use these kinds of features will help the community begin assembling good application frameworks to support Linked Data application development.


16
Dec 09

Annotated Data

One of the things I’ve always liked about the Semantic Web vision is the idea that “Anyone can say Anything, Anywhere” (hereafter: The AAA Principle). That I can publish data about anything; and which links to and annotates data that other people are publishing elsewhere. I’ve been thinking recently whether we’ve spent a lot of time focusing on the publishing of data and not enough about annotation. Some of this thinking is potentially heretical so I’m hoping for an interesting debate!

Before I leap into the heresy, lets review the key steps of publishing Linked Data:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
  4. Include links to other URIs. so that they can discover more things.

The dominant publishing pattern for Linked Data is for people to mint new URIs for their resources in a domain that they control. We then make links to other sources by using them as the object of statements in our data; owl:sameAs links are a special case of linking that asserts equality between the subject and object of that specific statement. Through this approach we tick off all of the Linked Data publishing steps.

Some people have argued that maybe we can drop the requirement of using RDF & SPARQL and still have “linked data”. I don’t agree with that, largely because the term already has a precise definition and so muddying it doesn’t really help the discussion. Publishing of data using HTTP URIs, using formats that natively define a linking mechanism, is to my mind simply “RESTful data publishing”. I’ve already recently referred to this as “web integrated data“. I mention this because its an approach to data publishing that only uses three of the four Linked Data publishing guidelines.

What would happen if we chose to follow some other subset of the guidelines? In fact, what if we didn’t assign URIs to things, or publish data at those URIs, and instead just published RDF to the web?

If we want to take advantage of The AAA Principle then technically we don’t need to assign URIs to things. Or rather, to be precise, we don’t need to assign new URIs to things. We can simply reuse someone else’s URI; no need to mint a new one. We also don’t need to publish data at those URIs: we just need to make sure that the data is linked into the growing web of data and is therefore discoverable. We can do this and still use/publish RDF. Lets refer to this form of publishing as “Annotated Data”, to distinguish it from Linked Data and Web Integrated Data.

Annotation is about publishing additional data about things that are already in the web. For that simple use case the need to deploy a Linked Data publishing framework is potentially overkill: publishing a document to a web server is all the machinery I need. Obviously by using someone else’s URIs I’m buying into the longevity of that URI space and the meaning of those identifiers. This may not be the right thing for some applications, but for many common use cases it may be good enough. Also, over time, as we get more hubs in the web of data, certain URI spaces are going to become much more stable because people will need them to be so in order to be reliable platforms upon which applications can be constructed. To put that another way: if we’re too fearful about relying on other peoples identifiers then we’ve got bigger problems.

Clearly if we’re just publishing RDF documents which contain statements about other people’s URIs then we can’t publish data at those URIs. So how will our annotations be found? How will it become part of the web of data? This is actually not that different to the current situation. Any given RDF data set may have links to a small number of other data sets, but it will never comprehensively have links to all possible related datasets. That level of co-ordination just isn’t achievable. It may also not be desirable: there may be valid reasons why I don’t want to have reciprocal links to everyone who links to me, e.g. spam or other untrusted data sources. The solution here is that services like sameas.org or sindice let us search and locate documents that refer to a specific resource, or other resources that have declared an equivalence. This same solution works for publishing Annotated Data: if we can ping a service or crawler that will index the content of our document then this small additional part can be linked into the whole. The current document web is not fully linked, so there’s no reason to expect the web of data to be either — there will always be the need for bridging/linking services.

What I’m describing here is broadly what we used to do in the early days of FOAF: we just published RDF documents with rdfs:seeAlso links and crawled them to compile data. This scruffy, lo-fi approach to the web of data was based on the assumption that having strong identifiers for things (particularly people) may not scale or be socially acceptable. It was also based on having more flexible notions of data merging; identification by description (”smushing”) gave us a little more leeway. Now we promote use of strong identifiers and strong notions of equality using owl:sameAs. This is clearly progress, as evidenced by the much larger collections of data we’ve created. But there are concerns about whether owl:sameAs may be too formal for lightweight Linked Data integration. Perhaps we could see these approaches as opposite ends of the spectrum, and be willing to explore more of the middle-ground?

Some questions that occur to me are:

  • Why not encourage people to reuse strong identifiers rather than create new ones. This reduces need for owl:sameAs linking, and makes it even easier to merge data.
  • Can smushing and approaches to using rdfs:seeAlso be more widely promoted/discussed as an approach to linking/fusion?
  • Can we create simple data annotation tools that let people contribute to the web of data without requiring that they follow all of the Linked Data principles?

The notion of Annotated Data I’ve described in this post is an attempt to start that conversation. Because it lowers the bar to contribution, it may be easier to move people up the “on ramp” to contributing to the web of data. And arguably as the web of data grows, increasingly what people and organizations will be doing is annotating existing resources rather than creating new ones.

As a concrete use cases, why not encourage publishers to simply publish RDF documents listing the foaf:topic’s of their content, but using dbpedia, or Freebase, or OpenCalais URIs as the topic URIs? This is simpler than publishing full Linked Data, is lower cost, and is fairly trivial to do using RDFa. They might later want to adopt more of the Linked Data publishing principles if they want more control over their URI schemes or are prepared to invest deeper in the technology.

Heresy or just good use of the full range of hypertext publishing mechanisms we have in RDF? Let me know your thoughts.


20
Nov 09

Web Integrated Data

Last Friday I spoke at the Open Knowledge Foundation Open Data & The Semantic Web event. I was giving the opening talk of the day and thought that I’d take the opportunity to lay out a view that I’ve been meaning to articulate for some time: that integrating data with the web maximises its utility. Moving from data dumps, through APIs, and to Linked Data we maximize utility by reducing the amount of effort required to interact with data.

While there’s clearly still a lot of work to do around creating ways to visualise and explore Linked Data, the simply utility of being able to browse a dataset means that we move beyond publishing for a developer audience to publishing for anyone who can wield a browser. This is the angle to the Semantic Web vision that is most often overlooked in my opinion.

Developers often claim that “I can do the same thing using technology X, so why use technology Y”. In this early adopter phase of the Semantic Web its perfectly valid and important to critique the technology; to measure its ease of use and benefits for developers. But for me the end game is to move to a world where anyone can easily do complex manipulations on data — without resorting to writing code — because there’s enough machine support to make it achievable. That’s what standard vocabularies and a common data model enables. And its a natural part of the evolution towards increasingly declarative ways of manipulating information.

I’ll do a proper write-up of the presentation some other time, but for now here are the slides:


19
Nov 09

Linked Data Liminal Zones

One of the things that has interested me for some time now is how RDF and Linked Data enables communities to enrich information published by organizations, e.g. by annotating it with additional properties and relationships (links). This is after all, one of the intended goals of the technology: to make it easier for people to converge on common names for things and collectively share data about those things.

The ability to publish URIs for things, and then have those URIs decorated by a motivated community with additional metadata, provides organizations with an interesting way to take advantage of Linked Data. The enriched data can be reused by the organization to improve its own datasets and used to drive improved processes, new product development, etc.

The interesting angle is that while both the organization and the community directly benefits from the sharing (both gain access to data they wouldn’t have normally, or at least without extra expense) there are some asymmetries in the relationship. Specifically, an organization worried about its brand is likely to have higher, or at least different, standards for reliability and quality than its community; especially so it we consider only the non-commercial users in that community. Before the organization may ingest and republish this data (e.g. on its website) then those standards and a certain amount of filtering may need to be applied.

I like to think of these contributions as being in a “liminal zone” between the authoritative content that is completely owned and managed by the publisher and the stuff that exists out there on the Wild Wild Web which is only tangentially related (at best). There’s a zone of transition between the two spaces, where the data and the URIs start out being owned by the publisher then embraced, adopted (and even co-opted) by a community. A user may want to freely navigate between these different areas and apply their own rules about quality, reliability or general bozo filtering. And they can end up in a very different space to where they started. An organization may want to act quite differently; in terms of what and how much they fetch, and how they use what data they collect.

The following diagram attempts to sketch out this liminal zone from the perspective of the BBC.

Its the user annotations that annotate or relate to the BBC URIs that form the liminal zone between the authoritative publisher-sourced data, and the rest of the content on the web. You could put almost any organization into that central space and the same relationship would hold. Its the strong identifiers associated with Linked Data that connects up the internal and external views of the data.

I recently commissioned a project at Talis called Fanhu.bz which aims to help surface content and contributions that exist in this liminal zone. I see it as a first step towards exploring some of these subtle data sharing issues. Mapping out the fringes of Linked Data sets, as exemplified by BBC Programmes and Music, and then exploring how that data can be remixed and reused not only by the community but also by the publisher themselves, is an attempt to explore models for consuming Linked Data that goes beyond simple re-publishing and visualisation. The technology has a lot more to offer. And when we talk about “Linked Data for the Enterprise” I think we need to be thinking beyond just internal data integration.


05
Nov 09

Describing SPARQL Extension Functions

At the end of my recent post on Surveying and Classifying SPARQL Extensions I noted that I wanted to help encourage implementors to publish useful documentation about their SPARQL Extensions. If you’re interested in the current state of that survey then you can check out my current spreadsheet listing known extension functions. There are more to add there, but its a good summary of the current state of play.

At VoCamp DC last week I did some work on designing a small vocabulary for describing SPARQL Extensions. The first draft of this is online here: SPARQL Extension Descriptions. There’s a little bit of background on the Vocamp wiki too, if you want to see my working :) .

Here’s an example of the vocabulary in use, describing some extensions to the ARQ SPARQL Engine:


<http://jena.hpl.hp.com/ARQ/function> a sed:FunctionLibrary;
  dc:title "ARQ Function Library";
  dc:description "A collection of SPARQL extension functions
      implemented by the ARQ engine";
  foaf:homepage <http://jena.sourceforge.net/ARQ/library-function.html>;
  sed:includes <http://jena.hpl.hp.com/ARQ/function#sha1sum>.

<http://jena.hpl.hp.com/ARQ/function#sha1sum>
  a ssd:ScalarFunction;
  rdfs:label "sha1sum";
  dc:description "Calculate the SHA1 checksum
       of a literal or URI.";
  sed:includedIn <http://jena.hpl.hp.com/ARQ/function#>.

<http://jena.hpl.hp.com/ARQ#self> a sed:SparqlProcessor;
  foaf:homepage <http://jena.hpl.hp.com/ARQ>;
  rdfs:label "ARQ";
  sed:implementsLibrary <http://jena.hpl.hp.com/ARQ/function>;

Ideally what should happen is that every URI associated with a filter function and property function should be dereferencable, and that terms from this vocabulary be used to describe those functions. There’s a lot more detail that could be included, but I suspect this is sufficient to cover the primary use cases, i.e. documentation and validation.

The draft SPARQL 1.1. Service Description specification does cover some of this ground, but falls short in a few places, and I think some of what I’ve described here could usefully be folded into that specification without greatly extending its scope. But thats a matter for the Working Group to decide.

One specific issue is that the specification doesn’t currently recognise “functional predicates” (to use Lee Feigenbaum’s preferred term; others include “property functions” and “magic properties”) as a distinct class of extensions. They clearly exist, so I think we should have a means to describe them. In fact arguably they are the most important class of SPARQL extensions that need describing.

Filter functions are relatively well understood and can clearly be identified based on where they appear in a query. Language extensions will generate a parser error if an endpoint doesn’t support them, so will easily be caught. But functional predicates use existing turtle triple pattern syntax, but typically involve triggering custom logic in the SPARQL processor, rather than actually appearing as triples within the dataset. Without the ability to dereference their URIs and identify them as a functional predicate, a SPARQL engine will simply treat them as a triple pattern and fail silently, rather than complaining that the extension is not supported.

The following example query illustrates this:


PREFIX list: <http://jena.hpl.hp.com/ARQ/list#>
PREFIX func: <http://jena.hpl.hp.com/ARQ/function#>
PREFIX dc: <http://purl.org/dc/terms/>
PREFIX ex: <http://example.org/vocab/>

SELECT ?doc ?contributor WHERE {
   ?s dc:modified ?created.
   ?s ex:authors ?authorList.
   ?authorList list:member ?author.
   LET ( ?contributor := ?author )
   FILTER ( ?created < func:now() )
}

The above query contains 3 extensions: a language extension (LET); a filter function (func:now()); and a functional predicate (list:member). Without prior knowledge of that predicate, or the ability to dereference its URI, there’s no way to know that the functional predicate is not really a triple that the query author is attempting to match against, rather than an extension.

I’d like to urge all implementors to consider making their extension URIs dereferencable. The schema I’ve drafted is very light-weight so shouldn’t be difficult to support. I’m also very happy to take comments on its design. I’m intending it as a starting point for others to help build upon.


05
Nov 09

Managing RDF Using Named Graphs

In this post I want to put down some thoughts around using named graphs to manage and query RDF datasets. This thinking is prompted is in large part by thinking how best to use Named Graphs to support publishing of Linked Data, but also most recently by the first Working Drafts drafts of SPARQL 1.1.

While the notion of named graphs for RDF has been around for many years now, the closest they have come to being standardised as a feature is through the SPARQL 1.0 specification which refers to named graphs in its specification of the dataset for a SPARQL query. SPARQL 1.1 expands on this, explaining how named graphs may be used in SPARQL Update, and also as part of the new Uniform HTTP Protocol for Managing RDF Graphs document.

Named graphs are an increasingly important feature of RDF triplestores and are very relevant to the notion of publishing Linked Data, so their use and specification does bear some additional analysis.

What Are Named Graphs?

Named Graphs turn the RDF triple model into a quad model by extending a triple to include an additional item of information. This extra piece of information takes the form of a URI which provides some additional context to the triple with which it is associated, providing an extra degree of freedom when it comes to managing RDF data. The ability to group triples around a URI underlies features such as:

  • Tracking provenance of RDF data — here the extra URI is used to track the source of the data; especially useful for web crawling scenarios
  • Replication of RDF graphs — triples are grouped into sets, labelled by a URI, that may then be separately exchanged and replicated
  • Managing RDF datasets — here the set of triples may be an entire RDF dataset, e.g. all of dbpedia, or all of musicbrainz, making it easier to identify and query subsets within an aggregation
  • Versioning — the URI identifies a set of triples, and that URI may be separately described, e.g. to capture the creation & modification dates of the triples in that set, who performed the change, etc.
  • Access Control — by identifying sets of triples we can then record access control related metadata

…and many more. There’s some useful background available on Named Graphs in general in a paper about NG4J, and specifically on their use in OpenAnzo.

Clearly there’s some degree of overlap between these different approaches, but then you’d expect that given that they’re all built on what is a fairly simple extension to the RDF model. Two of the key differentiators are:

  • Granularity: i.e. does the named graph relate to a discrete identifiable subset of a dataset, e.g. every statement about a specific resource, or does it identify the dataset itself, e.g. dbpedia
  • Concrete-ness: do the named graphs relate to how the data is actually being managed or stored; or does it instead reflect some other useful partitioning of the data?

One of the nice things about the simplicity of Named Graphs is that you can do so many things with that extra degree of freedom, i.e. by managing quads rather than triples.

Exchanging Named Graphs

Clearly if we’re working with Named Graphs then it would be useful if there were a way to exchange them. Being able to serialize RDF quads would allow a complete Named Graph to be transferred between stores. Actually, for some uses of Named Graphs this may not be required. For example if I’m using Named Graphs to as a means to track which triples came from which URIs during a web crawl I only need to serialize the quads if I decide to move data between the stores, not as part of the basic functionality.

Unsurprisingly none of the standard RDF vocabularies are capable of serializing Named Graphs, however there are two serializations that have been developed to support their interchange: TriG and TriX. TriG is a plain text format, which is a variant of Turtle, while TriX is a highly normalized XML format for RDF that includes the ability to name graphs.

Named Graphs in SPARQL 1.0

Lets look at how Named Graphs are used in SPARQL 1.0 and in the SPARQL 1.1 drafts. SPARQL 1.0 explains that a query executes against an RDF Dataset which “…represents a collection of graphs. An RDF Dataset comprises one graph, the default graph, which does not have a name, and zero or more named graphs, where each named graph is identified by an IRI. A SPARQL query can match different parts of the query pattern against different graphs“.

In practice one uses the FROM and FROM NAMED clauses to identify the default and named graphs, and the GRAPH keyword to apply triple pattern matches to specific graphs. There’s a few things to observe here already, some of which are consequences of the above, some from wording in the SPARQL protocol:

  • A SPARQL endpoint may not support Named Graphs at all
  • A SPARQL endpoint may let you define an arbitrary dataset for your query. Some open endpoints will fetch data by dereferencing the URIs mentioned in the FROM/FROM NAMED clause but thats quite rare; mainly because of efficiency, cost, and security reasons.
  • A SPARQL endpoint may not let you define the dataset for your query, i.e. it might use a fixed dataset scoped to some backing store. Any definition of the dataset in the protocol request or query is either optional, or must match the definition of the endpoint
  • A SPARQL endpoint may let you define the default graph to be used in a query, but may not be willing/able to do arbitrary graph merges. For example in an endpoint containing dbpedia and geonames, you might be able to select FROM one of them, but not both.
  • A SPARQL endpoint may be backed by a triple store that is organized around the model of an RDF Dataset, and therefore has a fixed default graph and any number of multiple named graphs. This limits flexibility of constructing the dataset for a query, as it is fixed by the underlying storage model.
  • A SPARQL endpoint may let you query graphs that don’t physically exist in the underlying tripestore. Such a synthetic graph may be, for example, the merge of all Named Graphs in the triple store.

There may be other variations, but I’m aware of implementations and endpoints that exhibit each of those outlined above. The important thing to realise is that while SPARQL doesn’t place any restrictions on how you use named graphs, implementation decisions of the endpoint and/or the underlying triple store may place some limits on how they can be used in queries. The other important point to draw out is that the set of named graphs exposed through a SPARQL query interface may be different than the set of named graphs managed in the backing storage. This is most obvious in the case of synthetic graphs.

Synthetic graphs are a very useful feature as they can provide some useful abstraction over how the underlying data is managed and how it is queried.

For example, one might use a large number of separate named graphs when managing data, thereby making it easy to merge and manage data from different sources (e.g. a web crawl). Some applications use thousands of very small Named Graphs simply because they’re easier to manage. By using a synthetic graph which exposes all of the data through a SPARQL endpoint as if it were in fact in a single graph, then its possible to abstract over those details of storage. There are a few stores that support this kind of technique, and it can be pushed further by making the definition of the synthetic graph more flexible, e.g. the set of all graphs that are valid for between particular dates, or the set of all graphs that are related by a specific URI pattern. This approach can help abstract away management/modelling issues that are necessary for dealing with issues like versioning.

Named Graphs in SPARQL 1.1

Lets look at how SPARQL 1.1 might impact on the above scenarios. I use “might” advisedly as its still early days, we’ve only just had the first public Working Drafts and so the state of play might change.

Section 4.1 of the SPARQL 1.1 Update draft notes that: If an update service is managing some graph store, then there is no presumption that this exactly corresponds to any RDF dataset offered by some query service. The dataset of the query service may be formed from graphs managed by the update service but the dataset requested by the query can be a subset of the graphs in the update service dataset and the names of those graphs may be different. The composition of the RDF dataset may also change.

So basically the set of RDF graphs exposed by an SPARQL 1.1 Update service may be disjoint from a Query service exposed by the same endpoint. This will always be the case if the Query endpoint exposes any synthetic graphs. These, presumably overlapping, sets make sense from the perspective of wanting some flexibility in how data is managed versus how it is queried. Its likely that we’ll see implementations offer a range of options with the most likely case being that the “core” set of graphs is identical, but that an additional set may be available for querying.

SPARQL 1.1 as it currently stands, includes an Uniform HTTP Protocol for Managing RDF Graphs. I’m very happy to see this and think that its an important part of the picture for publishing RDF data on the web in a RESTful way. As part of the overall Linked Data message we’ve been saying that “your website is your API”; that by assigning clear stable URIs to things in your system and then exposing both human and machine-readable data at those URIs, then Linked Data just drops out of the design. And this is also clearly a RESTful approach.

But to make things completely RESTful then we need to not only be able to read data from those URIs, we should be able to update the data at those URIs using the uniform protocol that HTTP defines. I was always a little wary of SPARQL Update because it seemed like it might supplant a more RESTful option, but I’m encouraged by the presence of this working draft that this won’t be the case. But I don’t think the draft goes far enough in a few places: I’d like to have the ability to make changes to individual statements within a graph, as well as just whole graphs, using techniques like Changesets.

The draft currently doesn’t get into the issues surrounding how URIs might be managed on a server, instead deferring that to the implementation. But I think its an important topic to explore, so lets devote some time to it here.

Approaches to Managing Named Graphs on the Web

For the most part the mapping of graph management to the web is uncontroversial, the four HTTP verbs of GET, PUT, POST and DELETE have obvious and intuitive meanings. Some of the subtleties arise out of issues such as how are URIs assigned to graphs, and what does that URI identify?

Client Managed Graph Identifiers

There are two ways that URIs can be assigned to graphs managed in a networked store. The first and simplest is that the client assigns all URIs. To create a new graph and populate it with data, we just PUT to a new URI. Starting from a base URI, distinct from any SPARQL endpoint the service might expose, the client can build out a URI space for the graphs but just PUTting to URIs. In this scenario one really only needs GET, PUT, and DELETE. POST doesn’t have any clear role, but could be used to handle, e.g. submissions of Changesets.

Even with the simple style of client-side URIs for graphs, there’s one wrinkle we need to address. As I explained in the start of the post there may be several different reasons why someone is using Named Graphs. Using the graph identifier to keep track of the source of the data is a fairly common requirement. So this means you have several options for how URIs might be supplied:

  1. /graphs/abc — here the client is building out a collection of named graphs whose identifiers all share a common prefix, with each having a suffix. We may end up with a relatively flat structure or a hierarchical one, e.g. /graphs/abc/123. There’s no implicit requirement that graphs URIs that have a hierarchical arrangement have any formal relationship, but this does have the useful property that the URIs are hackable.
  2. /graphs/http://www.example.org/abc — this is similar to the above except the unique portion of each graph name is a complete URI. This would probably need to be encoded but I’ve omitted that for readability. This approach is useful when using Named Graphs to track the source URI of a graph.
  3. /graphs?graph=http://www.example.org/abc — this is a variant of the second option but moving the graph identifier out into a parameter rather than allowing it to be put into the path info of the base URI. I think typically the value of the parameter would be a full rather than a relative URI, but a server could support resolving URIs against a base.

Its clear that while Option 1 provides nice clean identifiers for graphs, ultimately its limiting for scenarios where the graph may have another "natural" identifier, e.g. its source. for Options 2 & 3 we have to deal with URL encoding (especially if the URI itself contains parameters). Personally of the two alternate options I think 3 is nicer, if only aesthetically. I'm not aware of any problems or limitations with performing an HTTP PUT to a URI with parameters: it is the full Request URI, including any parameters, that identifies the resource being created or updated.

Server Assigned Graph Identifiers

A server managing Named Graphs may not allow clients to assign graph identifiers. For example, the server may want to enforce a particular naming conventions for graphs. This might also be useful for clients too, e.g. if they want to throw some data into a named graph as a scratch store. What restrictions does this scenario apply?

Firstly it would require the client to POST data to be stored to a generic graphs collection (/graphs), the server would then determine the graph URI, and then return an HTTP 201 response with a Location header indicating where the client can find the data it has just stored. This way the client would know whether to find the data and could then use further requests (GET/PUT/POST/DELETE) to manage it.

To support tracking of the source of a graph, one might allow a graph parameter to be added to the URI. And, to avoid a client having to maintain a local mapping from the original graph UI to the stored alias, the server could store the value of the graph parameter as metadata associated with the graph it creates. The server could support a GET request on /graphs?graph=X, returning a 302 redirect to the URI which is acting as a local alias for graph X. The client could then PUT/POST/DELETE that resource. If a client sent a repeated POST request, identifying the same graph URI, then the server could allow this, and return a 303 See Other response rather than a 201.

Its also possible to support a hybrid approach in which a client may PUT to any URI with a base of /graphs but disallow use of graph ids that start with http://. For those URIs, the server could require that a client let it assign the id, supporting the graph parameter as described earlier in this section.

There's no right or wrong way here. The differences fall out of the different ways we can map graph management onto the HTTP protocol. While a lot is fixed (methods, response codes and their semantics) if we are aiming to be RESTful, there are still some degrees of freedom with which to play around with different mappings. The SPARQL 1.1. uniform protocol specification doesn't address this, so perhaps there's room for the community to standardise best practices or conventions. However I think it'd be useful to at least see some informational text in the document.

Conclusions

Named Graphs are an important part of the overall technical framework for managing, publishing and querying RDF and Linked Data, and its important to understand the trade-offs in different approaches to using them. Hopefully this document is a step in the right direction.

If anyone has any strong opinions on how they think Named Graphs should be managed RESTfully, then please feel free to comment on this posting. I'm very interested to hear your thoughts.

One thing that interests me is: how can we use Named Graphs to support publishing of Linked Data? That's something I'll follow up on in a separate post.


20
Oct 09

Surveying and Classifying SPARQL Extensions

I realised recently that, while a lot of work has been done on creating and exploring interesting extensions to the SPARQL query language, there has yet to be a systematic survey of the range of different extensions that are currently implemented in various RDF triplestores. Or if there has been a survey, then I’ve clearly missed it.

In order to get a better idea of what kinds of extensions are available I’ve set myself the task of surveying those currently implemented. I intend to write-up and share the results of that work through this blog.

Rationale

I think that pulling together a list of extensions is a useful activity which should:

  • Help researchers and implementors to have a clearer view of existing work, thereby encouraging further experimentation
  • Promote convergence on a core set of useful extensions that could be implemented across a number of triplestores.
  • Help users to have a clearer understanding of what SPARQL extensions are currently supported in particular triplestores, letting them make informed decisions about which extensions to use when writing and sharing queries

It looks like the SPARQL Working Group may well be adding a standard library of extension functions into the next revision of the query language so the timing of this work should help contribute to that effort. However I’m looking beyond their immediate goals and hope to encourage the implementor community to explore models simple to the EXSLT effort which has been successful in creating a set of community-designed extensions for XSLT transformations. I see no reason why the same process can’t be applied to SPARQL extensions.

Clarity of which extensions are portable across triplestores is important to allow users to experiment with various triplestore implementations and services. If data is going to be truly portable, then this will be an important consideration.

With that in mind I’ve begun digging into the available documentation for a number of different triplestores. I’ve decided to organize my work by surveying each of the three different types of SPARQL extension.

Types of SPARQL Extension Function

Its possible to extend the SPARQL query language in any of the following three ways:

  • Extension Functions
  • Property Functions (aka “Magic Predicates”)
  • Language Extensions

Lets look at each of these in turn.

Extension Functions

Extension Functions are explicitly described by the current SPARQL specification under the banner of “extensible value testing“. The standard library of extensions that may be added to SPARQL 1.1 will fall into this category. Extension Functions are simple function calls that can be used within a FILTER in a SPARQL query to carry out some specific extra logic that cannot be handled by matching triple patterns. Examples of extension functions include substring testing, string concatenation, date tests, etc.

The specification indicates that these extension functions should have a unique URI, allowing them to be globally identified. Few engines are publishing useful information at these URIs, but this seems like it would be a useful thing to do. These URIs should be grounded in the web too.

Property Functions

Property Functions (aka “Magic Predicates”, or “Magic Properties”) are extensions to the triple matching process that is carried out when a SPARQL query is executed. This means that property functions don’t appear in a FILTER expression like an extension function. They instead appear within the graph pattern of the query. Unlike extension functions which have a syntax like a conventional functional call, property functions use turtle syntax and appear, to the untrained eye, as standard triple patterns.

For example, as property function that could split a resource URI into a namespace and a localname might look like this in a SPARQL query:


?uri a rdfs:Class.
?uri ex:splitURI (?namespace ?localname).

In that example the the property function ex:splitURI has as its input each of the URIs that are bound to the ?uri variable, and as its output binds the namespace URI and localname of those URIs to two new variables.

There are other ways to structure the inputs and outputs of a property function, depending on its purpose, but the important things to recognise are that:

  • the property function is written as a conventional triple pattern
  • parameters can be passed from either the subject or object portions of the triple (or potentially both)
  • similarly, output can be bound to variables that appear in either the subject or object portions of the triple
  • one technique for passing multiple parameters or generating multiple output values is to allow specification of an RDF list in the object portion of the triple

Property functions are very powerful as they can allow arbitrary complex logic to be used to extend the triple matching process. One common use is to extend the matching process by calling out to specialised indices or logic, e.g. for full-text indexing or geospatial functions and reasoning.

It is worth noting that Property Functions are not explicitly licensed by the current SPARQL specification. The specification does not describe them at all: they are simply allowed by the fact that they conform to the overall SPARQL grammar.

Testing whether a query uses Property Functions would therefore require a validator (such as the one that Dan Brickley describes here) to either have explicit knowledge of the function, e.g. based on its URI, or for implementors to publish some useful information at those locations so that a validator might determine whether a specific predicate is actually a “real” predicate or an extension through dereferencing the URI. I’m not aware of any implementation that currently does this.

Language Extensions

The final category of SPARQL extensions are extensions to the language itself. This type of extension involves amending the grammar of the language to include new operators, keywords, and types of expression. Examples of this type of extensions include sub-queries and aggregates (e.g. min and max). The forthcoming SPARQL 1.1 specification will standardise these and a few other language extensions that have been commonly implemented.

Arguably, if one changes the grammar of a language then you’re creating a new language: “SPARQL plus some extensions”. So some care needs to be taken with respect to this type of extension if one wants queries to be portable.

In my view while there is plenty of scope for the community to collaborate and converge on common extension of all of the types I’ve described here, the best place for language extensions to be formally ratified and agreed on is through the SPARQL Working Group. I personally don’t expect the Working Group to have to, or want to sign-off on every extension function or property function, but interoperability is ultimately best served by co-ordinating language extensions through the Working Group. Naturally this should happen after the implementor community have had a period of experimentation and research. This is obviously the process that has happened to date, and hopefully this will continue as the language continues to evolve. A bit of collective action ought to help ensure interoperability in other areas.

A Survey

For my survey of SPARQL extensions I’ve decided to tackle things in the order in which I have presented them here: I will first look at Extension Functions, then Property Functions, and then Language Extensions. For the rationale and reasons I’ve already outlined, I think the community is best served by organizing itself around standardising two of those types of extensions. And Extension Functions seem like the lowest hanging fruit.

I’m intending to do the survey in as open a way as possible, and want to ensure that I include as many different implementations as possible. Having said that initially I’m going to impose some editorial control simply to ensure consistency and quality. Implementors feel free to drop me a line providing me with information on your extensions or preferably pointers to the relevant documentation. I’ll also stress that while this survey has obvious relevance for my day job, that this is a personal project so things will progress as quickly as I’m able to find some time to push things forward.

I’m going to send regular status updates to the public-sparql-dev mailing list as that is the correct place for further discussion. I’ll also summarize my findings in further blog posts here. I’ve already begun the process of cataloguing Extension Functions as you can see by my recent email to the mailing list. I still have to include some additional information helpfully provided by OpenLink and to also update the entries for Mulgara to list its support for some of the EXSLT functions.

One other task I have on my list is to help provide some guidance on how implementors should publish information about their SPARQL extensions. It would be useful to have some descriptive metadata for these available from the relevant URIs. I’m intending to spend some time at Vocamp DC pulling together a vocabulary for that purpose. Let me know if you’re attending and want to collaborate.