RDF Data Access Options, or Isn’t HTTP already the API?

This is a follow-up to my blog post from yesterday about RDF and JSON. Ed Summers tweeted to say:

…your blog post suggests that an API for linked data is needed; isn’t http already the API?

I couldn’t answer that in 140 characters, so am writing this post to elaborate a little on the last section of my post in which I suggested that “there’s a big data access gulf between de-referencing URIs and performing SPARQL queries”. What exactly do I mean there? And why do I think that the Linked Data API helps?

Is Your Website Your API?

Most Linked Data presentations that discuss the publishing of data to the web typically run through the Linked Data principles. At point three we reach the recommend that:


“When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)

This has encourages us to create sites that consist of a mesh of interconnected resources described using RDF. We can “follow our nose” through those relationships to find more information.

This gives us two fundamental two data access options:

  • Resource Lookups: by dereferencing APIs we can obtain a (typically) complete description of a resource
  • Graph Traversal: following relationships and recursively de-referencing URIs to retrieve descriptions of related entities; this is (typically, not not necessarily) reconstituted into a graph on the client

However, if we take the “Your Website Is Your API” idea seriously, then we should be able to reflect all of the different points of interaction of that website as RDF, not just resource lookups (viewing a page) and graph traversal (clicking around).

As Tom Coates noted back in 2006 in “Native to a Web of Data“, good data-driven websites will have “list views and batch manipulation interfaces”. So we should be able to provide RDF views of those areas of functionality too. This gives us another kind of access option:

  • Listing: ability to retrieve lists/collections of things; navigation through those lists, e.g. by paging; and list manipulation, e.g. by filtering or sorting.

It’s possible to handle much of that by building some additional structure into your dataset, e.g. creating RDF Lists (or similar) of useful collections of resources. But if you bake this into your data then those views will potentially need to be re-evaluated every time the data changes. And even then there is still no way for a user to manipulate the views, e.g. to page or sort them.

So to achieve the most flexibility you need a more dynamic way of extracting and ordering portions of the underlying data. This is the role that SPARQL often fulfills, it provides some really useful ways to manipulate RDF graphs, and you can achieve far more with it than just extracting and manipulating lists of things.

SPARQL also supports another kind of access option that would otherwise require traversing some or all of the remote graph.

One example would be: “does this graph contain any foaf:name predicates?” or “does anything in this graph relate to http://www.example.org/bob?”. These kinds of existence checks, as well as more complex graph pattern matching, also tend to be the domain of SPARQL queries. It’s more expressive and potentially more efficient to just use a query language for that kind of question. So this gives us a fourth option:

  • Existence Checks: ability to determine whether a particular structure is present in a graph

Interestingly though they are not often the kinds of questions that you can “ask” of a website. There’s no real correlation with typical web browsing features although searching comes close for simple existence check queries.

Where the Linked Data API fits in

So there are at least four kinds of data access option. I doubt whether its exhaustive, but its a useful starting point for discussion.

SPARQL can handle all of these options and more. The graph pattern matching features, and provision of four query types lets us perform any of these kinds of interaction. For example A common way of implementing Resource Lookups over a triple store is to use a DESCRIBE or a CONSTRUCT query.

However the problem, as I see it, is that when we resort to writing SPARQL graph patterns in order to request, say, a list of people, then we’ve kind of stepped around HTTP. We’re no longer specifying and refining our query by interacting with web resources via parameterised URLs, we’re tunnelling the request for what we want in a SPARQL query sent to an endpoint.

From a hypermedia perspective it would be much better if there were a way to be able to handle the “Listing” access option using something that was better integrated with HTTP. It also happens that this might actually be easier for the majority of web developers to get to grips with, because they no longer have to learn SPARQL.

This is what I meant by a “RESTful API” in yesterday’s blog post. In my mind, “Listing things” sits in between Resource Lookups and Existence Checks or complex pattern matching in terms of access options.

It’s precisely this role that the Linked Data API is intended to fulfil. It defines a way to dynamically generate lists of resources from an underlying RDF graph, along with ways to manipulate those collections of resources, e.g. by sorting and filtering. It’s possible to use it to define a number of useful list views for an RDF dataset that nicely complements the relationships present in the data. It’s actually defined in terms of executing SPARQL queries over that graph, but this isn’t obvious to the end user.

These features are supplemented with the definition of simple XML and JSON formats, to supplement the RDF serializations that it supports. This is really intended to encourage adoption by making it easier to process the data using non RDF tools.

So, Isn’t HTTP the API?

Which brings me to the answer to Ed’s question: isn’t HTTP the API we need? The answer is yes, but we need more than just HTTP, we also need well defined media-types.

Mike Amundsen has created a nice categorisation of media types and a description of different types of factors they contain: H Factor.

Section 5.2.1.2 of Fielding’s dissertation explains that:


Control data defines the purpose of a message between components, such as the action being requested or the meaning of a response. It is also used to parameterize requests and override the default behavior of some connecting elements.

As it stands today neither RDF nor the Linked Data API specification ticks all of the the HFactor boxes. What we’ve really done so far is define how to parameterise some requests, e.g. to filter or sort based on a property value, but we’ve not yet defined that in a standard media type; the API configuration captures a lot of the requisite information but isn’t quite there.

That’s a long rambly blog post for a Friday night! Hopefully I’ve clarified what I was referring to yesterday. I absolutely don’t want to see anyone define an API for RDF that steps around HTTP. We need something that is much more closely aligned with the web. And hopefully I’ve also answered Ed’s question.

11 thoughts on “RDF Data Access Options, or Isn’t HTTP already the API?

  1. I must confess that I’m not fond of SPARQL.

    Well, I should paraphrase it: I have two sets of feelings on SPARQL, one quite distinct from, and much stronger than, the other.

    The first is the syntax itself: it’s similar-enough to SQL to make SQL people go “oh, this looks familiar!”, and then get horribly unstuck when they realise it works really very differently indeed.

    However, that’s a minor quibble. SPARQL’s here, it’s useful, it’s being actively worked on, and it’s integrated into lots of stuff. On that front, I largely put up and shut up save for some limited circles.

    The bigger complaint is of the gulf you’ve described in this and the previous blog post. People who aren’t part of the Linked Data “community” see those two words (or some approximate synonyms) and leap to conclusions: that they need to figure out RDF (probably), that they need to stuff everything they have in one of these newfangled triplestores (maybe), they need to expose a SPARQL endpoint (quite often not), which means leaning SPARQL (again, quite often not) and figuring out how it’s all supposed to work together. To the web developer thinking “hey, this Linked Data stuff sounds pretty cool, maybe my site and its data could join in!”, it all seems awfully complicated and offputting.

    Now, I’ll grant, this isn’t SPARQL’s fault per se, and so my statement above was a touch inaccurate, but invariably SPARQL ends up being considered to be part of the Linked Data “package”.

    To my mind, that’s crazy.

    The situation as it stands is this: there are some good, well-designed sites out there which expose RDF RESTfully, work sanely and it’s possible to do some clever stuff with them. There are some “lumps” of data out there which exist only as tarballs for the most part. There are some other lumps which are accessible with SPARQL and not much beyond. There are also a handful of useful human-facing consuming applications.

    The fact that there are relatively few sites exposing data well, and that there are even fewer consuming applications isn’t a technical problem. I mean, it’s true, there will be technical problems along the way — and figuring out how to let clients do the equivalent of fill in an HTML form to filter a large dataset is certainly part of that — but right now, the biggest problem faced by Linked Data [in my *very* humble opinion] is a PR problem.

    Linked Data _looks_ complicated. RDF isn’t the easiest thing in the world (and I’ve run across some truly horrible attempts to explain it), but triplestores and SPARQL and such — neat enabling technology — is so close to the fore that it all ends up looking even more complicated than it is.

    So I’m left with a sense that, yes… a standardised way of filtering and querying which isn’t as overarching and… (I’m going to regret saying this) not as much of a kludge as writing a script which exposes your SQL database to the world is definitely useful and will be needed, yet right now people can’t seem to get their heads around plain-old-HTTP-as-the-API as it is, and perhaps the big conundrum which needs the attention of smart people is walking before running.

    So, er, yeah — sorry. Bit of a rant, that, and none of it should be taken as criticism, because I do fundamentally agree with you, and both of these posts have been excellent.

  2. Just a clarification, when many people are talking about an “RDF API” they actually mean codified RDF Interfaces, on a par with Array, String, DOMElement and DOMNode – to enable interoperability between libraries & tools – not a web service type API or a transfer protocol for linked data. The two are quite different.

    That aside, fully agree with both yourself and Mike Amundsen, would be interesting to get some requirements on what hypermedia semantics / factors would be required of a hyper-media-type for RDF.

  3. Nathan,

    Yes, there is a difference, and in my previous post I highlighted the importance of both of those things.

    I regret using the ambiguous term API in that post. People do use the term API to refer to both programming language interfaces and Web interfaces. This piece is intended to clarify one aspect of what I meant, which is the hypermedia aspects. I’ve not fleshed out my thinking around language interfaces, but agree with Jeni’s point that exploring path languages is a fruitful direction.

    Cheers,

    L.

  4. Thanks very much for the thoughtful response to my hastily-tapped tweet. I like the catch-phrase “Your website is your API”, it’s kind of like the elevator pitch of RESTful web development I think. But I also think “website” is kind of a nebulous term–that isn’t really part of the terminology of Web standards, architecture, etc. So its utility kind of breaks down once you get past a superficial look at how data should be made available on the web.

    It sounds like we are in agreement about HTTP being the API that Linked Data should use. But you are suggesting that application/rdf+xml, application/sparql-query and application/sparq-l-results+xml media-types definitions haven’t been specified or useful enough to enable Linked Data applications to find broader adoption?

    I feel similarly to Mo (above) in that I’m not a big fan of SPARQL, but I think it has its place. Much like you describe the DOM in your previous post: it’s a standardized way of interacting with an RDF graph–much like a developer would use the DOM to work with the an XML/HTML document. Some of the interactions like update are still a bit underspecified, but people are working on that right?

    As Mo also said above, people who are new to the Linked Data space are often given the impression that they need to ‘triplify’ their data, by pouring whatever data they have into a triple store, and then using SPARQL to query it to generate web pages, and data views. Do you think the Web would’ve taken off if anyone who wanted to create some HTML would have had to use a RDBMS and SQL to publish their homepage or what have you? I think not. The beautiful thing about the Web is you could generate that HTML however you wanted, and it could link to other resources and websites that are generated some completely different way. The engine of application state is hypertext driven.

    Frankly, I see SPARQL as an implementation detail. You may generate your RDF data views using SPARQL to query of some triplestore…or you may do it some other way, or even manually add the assertions as RDFa to a static HTML document.

    I think your post is on to something: for Linked Data to be widely adopted we need to show how application/rdf+xml and (an official) text/turtle can be useful representations to serve up on the Web. And I also think we need take to help the JSON community deal with making their data more suitable for hypermedia.

    Personally I’d like to see a standard JSON serialization format for RDF that would allow me to not scratch my head when I wanted to make idiomatic JSON available that expressed typed links to other resources. I’d like to see a JSON serialization for RDF that allowed me to read in some RDF and work with it naturally using the List and Dictionary data structures available to me in my computer language of choice. Perhaps this isn’t possible, because of all that RDF brings to the table. But I think Atom have already done it for XML, so it ought to be possible 🙂

  5. This is in line with some things I’ve been thinking recently, about the drawbacks of data access using SPARQL queries when compared to RESTful data retrieval interfaces. I’ve come up with four things:

    1. It’s specific to the Semantic Web community and has a steep learning curve for the newcomers and outsiders.

    2. Formulating well-formed queries and interpreting results presents a technical burden for the client. There are few relevant tools and libraries available, and integration with standard platforms and technology stacks is challenging.

    3. It exposes the underlying technology, which causes tighter coupling. A switch to non-RDF datastore would almost certainly require a non-compatible change to the published interface.

    4. It requires the client to acquire and handle metadata of the kinds of information available to form specific queries. This causes coupling of client code to the details of RDF schema in the datastore.

    I think much of these are touched by or covered with your posts, especially 1 & 2. Would you agree?

  6. I don’t agree that SPARQL via http wouldn’t be a Web-API. A Web-API is a RESTful Web-Service. This is exactly what a SPARQL endpoint with HTTP API is.
    Yes, it tunnels the query to a SPARQL server. Other Web-APIs does the same: they take e.g. a query parameter, give that to the server, and answer with what the server gives back. I don’t agree that it would be helpful to build some abbrevations just to loose the “tunneling-effect” and go to the more easier “parametrized” form of querying. If you have complex APIs, it follows that you have complex queries. Anything else would let loose you power. Take e. g. search-engine APIs which are also RESTful ( http://www.elasticsearch.org/guide/reference/api/search/ ).
    Now we have the end of 2012, and there are specifications how to use SPARQL as a RESTful ( http://www.w3.org/TR/sparql11-update/ ) service. Nowadays most triple stores implements the RESTful layer.

    I agree with you totally, though, that SPARQL is just a nice to have to do some extra fancy things. For most people, just having a derefencing HTTP-URI that provides RDF (as rdf/xml, turtle, ntriples or, as most web developer will be quite happy with, JSON-LD) is fine enough.

Comments are closed.