Web


3
Dec 10

RDF Data Access Options, or Isn’t HTTP already the API?

This is a follow-up to my blog post from yesterday about RDF and JSON. Ed Summers tweeted to say:

…your blog post suggests that an API for linked data is needed; isn’t http already the API?

I couldn’t answer that in 140 characters, so am writing this post to elaborate a little on the last section of my post in which I suggested that “there’s a big data access gulf between de-referencing URIs and performing SPARQL queries”. What exactly do I mean there? And why do I think that the Linked Data API helps?

Is Your Website Your API?

Most Linked Data presentations that discuss the publishing of data to the web typically run through the Linked Data principles. At point three we reach the recommend that:


“When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)

This has encourages us to create sites that consist of a mesh of interconnected resources described using RDF. We can “follow our nose” through those relationships to find more information.

This gives us two fundamental two data access options:

  • Resource Lookups: by dereferencing APIs we can obtain a (typically) complete description of a resource
  • Graph Traversal: following relationships and recursively de-referencing URIs to retrieve descriptions of related entities; this is (typically, not not necessarily) reconstituted into a graph on the client

However, if we take the “Your Website Is Your API” idea seriously, then we should be able to reflect all of the different points of interaction of that website as RDF, not just resource lookups (viewing a page) and graph traversal (clicking around).

As Tom Coates noted back in 2006 in “Native to a Web of Data“, good data-driven websites will have “list views and batch manipulation interfaces”. So we should be able to provide RDF views of those areas of functionality too. This gives us another kind of access option:

  • Listing: ability to retrieve lists/collections of things; navigation through those lists, e.g. by paging; and list manipulation, e.g. by filtering or sorting.

It’s possible to handle much of that by building some additional structure into your dataset, e.g. creating RDF Lists (or similar) of useful collections of resources. But if you bake this into your data then those views will potentially need to be re-evaluated every time the data changes. And even then there is still no way for a user to manipulate the views, e.g. to page or sort them.

So to achieve the most flexibility you need a more dynamic way of extracting and ordering portions of the underlying data. This is the role that SPARQL often fulfills, it provides some really useful ways to manipulate RDF graphs, and you can achieve far more with it than just extracting and manipulating lists of things.

SPARQL also supports another kind of access option that would otherwise require traversing some or all of the remote graph.

One example would be: “does this graph contain any foaf:name predicates?” or “does anything in this graph relate to http://www.example.org/bob?”. These kinds of existence checks, as well as more complex graph pattern matching, also tend to be the domain of SPARQL queries. It’s more expressive and potentially more efficient to just use a query language for that kind of question. So this gives us a fourth option:

  • Existence Checks: ability to determine whether a particular structure is present in a graph

Interestingly though they are not often the kinds of questions that you can “ask” of a website. There’s no real correlation with typical web browsing features although searching comes close for simple existence check queries.

Where the Linked Data API fits in

So there are at least four kinds of data access option. I doubt whether its exhaustive, but its a useful starting point for discussion.

SPARQL can handle all of these options and more. The graph pattern matching features, and provision of four query types lets us perform any of these kinds of interaction. For example A common way of implementing Resource Lookups over a triple store is to use a DESCRIBE or a CONSTRUCT query.

However the problem, as I see it, is that when we resort to writing SPARQL graph patterns in order to request, say, a list of people, then we’ve kind of stepped around HTTP. We’re no longer specifying and refining our query by interacting with web resources via parameterised URLs, we’re tunnelling the request for what we want in a SPARQL query sent to an endpoint.

From a hypermedia perspective it would be much better if there were a way to be able to handle the “Listing” access option using something that was better integrated with HTTP. It also happens that this might actually be easier for the majority of web developers to get to grips with, because they no longer have to learn SPARQL.

This is what I meant by a “RESTful API” in yesterday’s blog post. In my mind, “Listing things” sits in between Resource Lookups and Existence Checks or complex pattern matching in terms of access options.

It’s precisely this role that the Linked Data API is intended to fulfil. It defines a way to dynamically generate lists of resources from an underlying RDF graph, along with ways to manipulate those collections of resources, e.g. by sorting and filtering. It’s possible to use it to define a number of useful list views for an RDF dataset that nicely complements the relationships present in the data. It’s actually defined in terms of executing SPARQL queries over that graph, but this isn’t obvious to the end user.

These features are supplemented with the definition of simple XML and JSON formats, to supplement the RDF serializations that it supports. This is really intended to encourage adoption by making it easier to process the data using non RDF tools.

So, Isn’t HTTP the API?

Which brings me to the answer to Ed’s question: isn’t HTTP the API we need? The answer is yes, but we need more than just HTTP, we also need well defined media-types.

Mike Amundsen has created a nice categorisation of media types and a description of different types of factors they contain: H Factor.

Section 5.2.1.2 of Fielding’s dissertation explains that:


Control data defines the purpose of a message between components, such as the action being requested or the meaning of a response. It is also used to parameterize requests and override the default behavior of some connecting elements.

As it stands today neither RDF nor the Linked Data API specification ticks all of the the HFactor boxes. What we’ve really done so far is define how to parameterise some requests, e.g. to filter or sort based on a property value, but we’ve not yet defined that in a standard media type; the API configuration captures a lot of the requisite information but isn’t quite there.

That’s a long rambly blog post for a Friday night! Hopefully I’ve clarified what I was referring to yesterday. I absolutely don’t want to see anyone define an API for RDF that steps around HTTP. We need something that is much more closely aligned with the web. And hopefully I’ve also answered Ed’s question.


2
Dec 10

RDF and JSON: A Clash of Model and Syntax

I had been meaning to write this post for some time. After reading Jeni Tennison’s post from earlier this week I had decided that I didn’t need too, but Jeni and Thomas Roessler suggested I publish my thoughts. So here they are. I’ve got more things to say about where efforts should be expended in meeting the challenges that face us over the next period of growth of the semantic web, but I’ll keep those for future posts.

Everyone agrees that a JSON serialization of RDF is a Good Thing. And I think nearly everyone would agree that a standard JSON serialization of RDF would be even better. The problem is no-one can agree on what constitutes a good JSON serialization of RDF. As the RDF Next Working Group is about to convene to try and define a standard JSON serialization now is a very good time to think about what it is we really want them to achieve.

RDF in JSON, is RDF in XML all over again

There are very few people who like RDF/XML. Personally, while it’s not my favourite RDF syntax, I’m glad its there for when I want to convert XML formats into RDF. I’ve even built an entire RDF workflow that began with the ingestion of RDF/XML documents; we even validated them against a schema!

There are several reasons why people dislike RDF/XML.

Firstly, there is a mis-match in the data models: serialization involves turning a graph into a tree. There are many different ways to achieve that so, without applying some external constraints, the output can be highly variable. The problem is that those constraints can be highly specific, so are difficult to generalize. This results in a high degree of syntax variability of RDF/XML in the wild, and that undermines the ability to use RDF/XML with standard XML tools like XPath, XSLT, etc. They (unsurprisingly) operate only on the surface XML syntax not the “real” data model.

Secondly, people dislike RDF/XML because of the mis-match in (loosely speaking) the native data types. XML is largely about elements and attributes whereas RDF has resources, properties, literals, blank nodes, lists, sequences, etc. And of course there are those ever present URIs. This leads to additional syntax short-cuts and hijacking of features like XML Namespaces to simplify the output, whilst simultaneously causing even more variability in the possible serializations.

Thirdly, when it comes to parsing, RDF/XML just isn’t a very efficient serialization. It’s typically more verbose and can involve much more of a memory overhead when parsing than some of the other syntaxes.

Because of these issues, we end up with a syntax which, while flexible, requires some profiling to be really useful within an XML toolchain. Or you just ignore the fact that its XML at all and throw it straight into a triple store, which is what I suspect most people do. If you do that then an XML serialization of RDF is just a convenient way to generate RDF data from an XML toolchain.

Unfortunately when we look at serializing RDF as JSON we discover that we have nearly all of the same issues. JSON is a tree; so we have the same variety of potential options for serializing any given graph. The data types are also still different: key-value pairs, hashes, lists, strings, dates (of a form!), etc. versus resource, properties, literals, etc. While there is potential to use more native datatypes, the practical issues of repeatable properties, blank nodes, etc mean that a 1:1 mapping isn’t feasible. Lack of support for anything like XML Namespaces means that hiding URIs is also impossible without additional syntax conventions.

So, ultimately, both XML and JSON are poor fits for handling RDF. I think most people would agree that a specific format like Turtle is much easier to work with. It’s also better as starting point for learning RDF because most of the syntax is re-used in SPARQL. That’s why standardising Turtle, ideally extended to support Named Graphs, needs to be the first item on the RDF Next Working Group’s agenda.

What do we actually want?

What purpose are we trying to achieve with a JSON serialization of RDF? I’d argue that there are several goals:

  1. Support for scripting languages: Provide better support for processing RDF in scripting languages
  2. Creating convergence: Build some convergence around the dizzying array of existing RDF in JSON proposals, to create consistency in how data is published
  3. Gaining traction: Make RDF more acceptable for web developers, with the hope of increasing engagement with RDF and Linked Data

I don’t think that anyone considers a JSON serialization of RDF as a better replacement for RDF/XML. I think everyone is looking to Turtle to provide that.

I also don’t think that anyone sees JSON as a particularly efficient serialization of RDF, particularly for bulk loading. It might be, but I think N-Triples (a subset of Turtle) fulfills that niche already: it’s easy to stream and to process in parallel.

Lets look at each of those goals in turn.

Support for scripting languages

Unarguably its much, much easier to process JSON in scripting languages like Javascript, Ruby, PHP than RDF/XML.

Parser support for JSON is ubiquitous as its the syntax de jour. Just as XML was when the RDF specifications were being written. Typically JSON parsing is much more efficient. That’s especially true when we look at Javascript in the browser.

From that perspective RDF in JSON is an instant win as it will simplify consumption of Linked Data and the results of SPARQL CONSTRUCT and DESCRIBE queries. There are other issues with getting wide-spread support for RDF across different programming languages, e.g. proper validation of URIs, but fast parsing of the basic data structure would be a step in the right direction.

Creating Convergence

I think I’ve seen about a dozen or more different RDF in JSON proposals. There’s a list on the ESW wiki and some comparison notes on the Talis Platform wiki, but I don’t think either are complete. If I get chance I’ll update them. The sheer variety confirms my earlier points about the mis-matches between models: everyone has their own conception of what constitutes a useful JSON serialization.

Because there are less syntax options in JSON, the proposals run the full spectrum from capturing the full RDF model but making poor use of JSON syntax, through to making good use of JSON syntax but at the cost of either ignoring aspects of the RDF model or layering additional syntax conventions on top of JSON itself. As an aside, I find it interesting that so many people are happy with subsetting RDF to achieve this one goal.

The thing we should recognise is that none of the existing RDF in JSON formats are really useful without an accompanying API. I’ve used a number of different formats and no matter what serialization I’ve used I’ve ended up with helper code that simplifies some or all of the following:

  • Lookup of all properties of a single resource
  • Mapping between URIs and short names (e.g. CURIES or locally defined keys) for properties
  • Mapping between conventions for encoding particular datatypes (or language annotations) and native objects in the scripting language
  • Cross-referencing between subjects and objects; and vice-versa
  • Looking up all values of a property or a single value (often the first)

In addition, if I’m consuming the results of multiple requests then I may also end up with a custom data structure and code for merging together different descriptions. Even if its just an array of parsed JSON documents and code to perform the above lookups across that collection.

So, while we can debate the relative aesthetics of different approaches, I think its focusing attention on the wrong areas. What we should really be looking at is an API for manipulating RDF. One that will work in Javascript, Ruby or PHP. While I acknowledge the lingering horror of the DOM, I think the design space here is much simpler. Maybe I’m just an optimist!

If we take this approach then what we need is an JSON serialization of RDF that covers as much of the RDF model as possible and, ideally, is already as well supported as possible. From what I’ve seen the RDF/JSON serialization is actually closest to that ideal. It’s supported in a number of different parsing and serialising libraries already and only needs to be extended to support blank nodes and Named Graphs, which is trivial to do. While its not the prettiest serialization, given a vote, I’d look at standardising that and moving on to focus on the more important area: the API.

Gaining Traction

Which brings me to the last use case. Can we create a JSON serialization of RDF that will help Linked Data and RDF get some traction in the wider web development community?

The answer is no.

If you believe that the issues with gaining adoption are purely related to syntax then you’re not listening to the web developer community closely enough. While a friendlier syntax may undoubtedly help, an API would be even better. The majority of web developers these days are very happy indeed to work with tools like JQuery to handle client-side scripting. A standard JQuery extension for RDF would help adoption much more than spending months debating the best way to profile the RDF model into a clean JSON serialization.

But the real issue is that we’re asking web developers to learn not just new syntax but also an entirely new way to access data: we’re asking them to use SPARQL rather than simple RESTful APIs.

While I think SPARQL is an important and powerful tool in the RDF toolchain I don’t think it should be seen as the standard way of querying RDF over the web. There’s a big data access gulf between de-referencing URIs and performing SPARQL queries. We need something to fill that space, and I think the Linked Data API fills that gap very nicely. We should be promoting a range of access options.

I have similar doubts about SPARQL Update as the standard way of updating triple stores over the web, but that’s the topic of another post.

Summing Up

As the RDF Next Working Group gets underway I think it needs to carefully prioritise its activities to ensure that we get the most out of this next phase of development and effort around the Semantic Web specifications. It’s particularly crucial right now as we’re beginning to see the ideas being adopted and embraced more widely. As I’ve tried to highlight here, I think there’s a lot of value to be had in having a standard JSON serialization of RDF. But I don’t think that there’s much merit in attempting to create a clean, simple JSON serialization that will meet everyone’s needs.

Standardising Turtle and an API for manipulating RDF data has more value in my view. RDF/JSON as a well implemented specification meets the core needs of the semantic web developer; a simple scripting API meets the needs of everyone else.


6
Apr 10

Linked Data Patterns: a free book for practitioners

A few months ago Ian Davis and I were chatting about some new approaches to helping practitioners climb the learning curve around Linked Data, RDF and related technologies. We were both keen to help communicate the value of Linked Data, share knowledge amongst practitioners, and to encourage the community to converge on best practices. We kicked around a number of different ideas in this vein.

For example, Ian was keen to provide guidance as to how to mix and match different vocabularies to achieve a particular goal, like describing a person or a book. Having a ready reference containing recipes for these common tasks would address a number of goals. He’s ended up exploring that idea further in the recently released Schemapedia. If you’ve not seen it yet, then you should take a look. It provides a really nice way to navigate through RDF vocabularies and explore their intersections.

The other thing that we discussed was Design Patterns. I’ve been a Design Pattern nut for some time now. Discovering them was something of a right of passage for me during my Master’s dissertation. I’d spent weeks revising and honing a design for the distributed system I was building, only to discover that what I’d produced was already documented as a design pattern in an obscure corner of the research literature. While I’d clearly reinvented the wheel, the discovery not only provided external validation for what I’d produced, but also neatly illustrated the benefit of using design patterns to share knowledge and experience within a community. Knowing when to apply particular patterns is a key skill for any developer, and the terms are a part of the design vocabulary we all share.

I suggested to Ian that we explore writing some patterns for Linked Data. Patterns for assigning identifiers, modelling data, as well as application development. We experimented with this for a while but ended up parking the discussion for a few months whilst other priorities intervened.

I recently revived the project. It’s pretty clear to me that there’s still a big skills gap between experienced practitioners and those seeking to apply the technology. I think the current situation is reminiscent of the move of OO programming from the research lab out into the developer community; design patterns played a key role there too.

Ian and I have decided to share this with the community as an on-line book, a pattern catalogue that covers a range of different use cases. We started out with about half a dozen patterns, but over the last few weeks I’ve expanded that figure to thirty. I’ve still got a number on my short-list (more than a dozen, I think) but it’s time to start sharing this with the community. The work won’t ever be complete as the space is still unfolding, it will just get refined over time.

You can read the book online at http://patterns.dataincubator.org.

The work is licensed under a Creative Commons Attribution license so you’re free to use it as you see fit, but please attribute the source. If you want to download it, then there’s a PDF, and an EPUB too. We’re using DocBook for the text so there will be a number of different access options.

I’ll stress that this is a very early draft, so be gentle. But we’d love to hear your comments.


28
Mar 10

Enhanced Descriptions: “Premium Linked Data”

I’ve had several conversations recently with people who are either interested in, or actually implementing Linked Data, and are struggling with some important questions

  • How much data should I give away?
  • If I wanted to charge for more than just the basic data, then how would I handle that?

My usual response to the first of those questions is: “as much as you feel comfortable with”. There’s still so much data that’s not yet visible or accessible in machine-readable formats that any progress is good progress. Let’s get more data out there now. More is better.

It usually doesn’t take long to get to the second question. If you’ve spent time evangelising to people about the power and value of data, and particularly their data, then its natural for them to begin thinking about how it can be monetized.

Scott Brinker has done a good job of summarising a range of options for Linked Data business models. I’ve chipped into that discussion already. Instead what I wanted to briefly discuss here is some of the mechanics of implementing access to what we might call “premium Linked Data”, or as I’ll refer to it “Enhanced Descriptions”.

Premium Linked Data

It’s possible to publish Linked Data that is entirely access controlled. Access might be limited to users behind the firewall (”Enterprise Linked Data”) or only to authorised paying customers. As a paid up customer you’d be given an entry point into that Linked Data and would supply appropriate credentials in order to access it.

This data isn’t going to be something you’d discover on the open web. There are many different authentication models that could be used to mediate access to this “Dark Data”. The precise mechanisms aren’t that important and the right one is likely to vary for different industries and use cases. Although I think there’s a strong argument in using something that dove-tails nicely with HTTP and web infrastructure in general.

What interests me more is the scenario in which a data publisher might be exposing some public data under a liberal open license, but also wants to make available some “premium” metadata. I.e. some value-added data that is only available to paid-up customers. In this scenario it would be useful to be able to link together the open and closed data, allowing a user agent to detect that there is extra value hidden behind some kind of authentication barrier. I think this is likely to become a very common pattern as it aids discovery of the value-added material. Essentially its the existing pattern for access controlling content that we have on the web of documents.

Its the mechanics of implementing this public/private scenario that has cropped up in my recent conversations.

Enhanced Descriptions

When I dereference the URI of a resource I will typically get redirected to a document that describes that resource. This document might contain data like this (in Turtle):


ex:document
  foaf:primaryTopic ex:thing.

ex:thing
  rdfs:label "Some Thing".

i.e. the document contains some data about the resource, and there’s a primary topic relationship between the document and the resource.

If we want to point to additional RDF documents that also describe this resource, or related data, then we can use an rdfs:seeAlso link:


ex:document
  foaf:primaryTopic ex:thing.

ex:thing rdfs:label "Some Thing";
  rdfs:seeAlso ex:otherDocument.

We can use the rdfs:seeAlso relationship to point to additional documents either within a specific dataset or in other locations on the web. Those documents provide useful annotations about a resource.

An “Enhanced Description” will contain additional value-added data about a resource. We could just refer to this document using an rdfs:seeAlso link. But if we do that then a user agent can’t easily distinguish between an arbitrary rdfs:seeAlso link and one that refers to some additional data. We could instead use an additional relationship, a specialisation of rdfs:seeAlso, that can be used to disambiguate between the relationships. I’ve defined just such a predicate: ov:enhancedDescription.


ex:document
  foaf:primaryTopic ex:thing.

ex:thing rdfs:label "Some Thing";
  rdfs:seeAlso ex:otherDocument;
  ov:enhancedDescription ex:premiumDocument.

By using a separate document to hold the value-added annotations we have the opportunity for user agents to identify those documents (via the predicate) and to also be challenged for credentials when they retrieve the URI (e.g. with an HTTP 401 response code).

It also means data publishers can safely dip a toe in the open data waters, but leave richer descriptions protected but still discoverable behind an access control layer.

Another Approach?

Interestingly I discovered earlier today that OpenCalais returns a “402 Payment Required” status code for some documents.

To see this in practice visit their description of IBM and try accessing the last of the owl:sameAs links. I’m guessing they’re using a similar technique to the one I’ve outlined here. But the key difference is that rather than use separate documents, they’ve decided to create new URIs for the access controlled version of the Linked Data. It would be nice if someone out there could confirm that.

Assuming I’ve interpreted what they’re doing correctly, I think this approach has some failings. Firstly it creates extra URIs that aren’t really needed. I’m not sure that we really need more URIs for things; a pattern in which publishers have 2 URIs (public & private) for each resource isn’t going to help matters

Secondly, just like using a generic “see also” relation, using owl:sameAs means its impossible to detect which resource is the one providing access to premium data, and others that exist on the web, without doing some fragile URI matching.

Apologies for the OpenCalais team if I’ve misunderstood the mechanism they’re using. I’ll happily publish a correction, but regardless, I’m intrigued by the 402 status code! :)

Summary

In my view, the “Enhanced Description” approach is a simple to implement pattern. Its one that I’ve been recommending to people recently but I’ve not seen documented anywhere, so thought I’d write it up.

I’d be interested to hear from others that have either implemented the same mechanism, or like OpenCalais are using other schemes.


16
Dec 09

Annotated Data

One of the things I’ve always liked about the Semantic Web vision is the idea that “Anyone can say Anything, Anywhere” (hereafter: The AAA Principle). That I can publish data about anything; and which links to and annotates data that other people are publishing elsewhere. I’ve been thinking recently whether we’ve spent a lot of time focusing on the publishing of data and not enough about annotation. Some of this thinking is potentially heretical so I’m hoping for an interesting debate!

Before I leap into the heresy, lets review the key steps of publishing Linked Data:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
  4. Include links to other URIs. so that they can discover more things.

The dominant publishing pattern for Linked Data is for people to mint new URIs for their resources in a domain that they control. We then make links to other sources by using them as the object of statements in our data; owl:sameAs links are a special case of linking that asserts equality between the subject and object of that specific statement. Through this approach we tick off all of the Linked Data publishing steps.

Some people have argued that maybe we can drop the requirement of using RDF & SPARQL and still have “linked data”. I don’t agree with that, largely because the term already has a precise definition and so muddying it doesn’t really help the discussion. Publishing of data using HTTP URIs, using formats that natively define a linking mechanism, is to my mind simply “RESTful data publishing”. I’ve already recently referred to this as “web integrated data“. I mention this because its an approach to data publishing that only uses three of the four Linked Data publishing guidelines.

What would happen if we chose to follow some other subset of the guidelines? In fact, what if we didn’t assign URIs to things, or publish data at those URIs, and instead just published RDF to the web?

If we want to take advantage of The AAA Principle then technically we don’t need to assign URIs to things. Or rather, to be precise, we don’t need to assign new URIs to things. We can simply reuse someone else’s URI; no need to mint a new one. We also don’t need to publish data at those URIs: we just need to make sure that the data is linked into the growing web of data and is therefore discoverable. We can do this and still use/publish RDF. Lets refer to this form of publishing as “Annotated Data”, to distinguish it from Linked Data and Web Integrated Data.

Annotation is about publishing additional data about things that are already in the web. For that simple use case the need to deploy a Linked Data publishing framework is potentially overkill: publishing a document to a web server is all the machinery I need. Obviously by using someone else’s URIs I’m buying into the longevity of that URI space and the meaning of those identifiers. This may not be the right thing for some applications, but for many common use cases it may be good enough. Also, over time, as we get more hubs in the web of data, certain URI spaces are going to become much more stable because people will need them to be so in order to be reliable platforms upon which applications can be constructed. To put that another way: if we’re too fearful about relying on other peoples identifiers then we’ve got bigger problems.

Clearly if we’re just publishing RDF documents which contain statements about other people’s URIs then we can’t publish data at those URIs. So how will our annotations be found? How will it become part of the web of data? This is actually not that different to the current situation. Any given RDF data set may have links to a small number of other data sets, but it will never comprehensively have links to all possible related datasets. That level of co-ordination just isn’t achievable. It may also not be desirable: there may be valid reasons why I don’t want to have reciprocal links to everyone who links to me, e.g. spam or other untrusted data sources. The solution here is that services like sameas.org or sindice let us search and locate documents that refer to a specific resource, or other resources that have declared an equivalence. This same solution works for publishing Annotated Data: if we can ping a service or crawler that will index the content of our document then this small additional part can be linked into the whole. The current document web is not fully linked, so there’s no reason to expect the web of data to be either — there will always be the need for bridging/linking services.

What I’m describing here is broadly what we used to do in the early days of FOAF: we just published RDF documents with rdfs:seeAlso links and crawled them to compile data. This scruffy, lo-fi approach to the web of data was based on the assumption that having strong identifiers for things (particularly people) may not scale or be socially acceptable. It was also based on having more flexible notions of data merging; identification by description (”smushing”) gave us a little more leeway. Now we promote use of strong identifiers and strong notions of equality using owl:sameAs. This is clearly progress, as evidenced by the much larger collections of data we’ve created. But there are concerns about whether owl:sameAs may be too formal for lightweight Linked Data integration. Perhaps we could see these approaches as opposite ends of the spectrum, and be willing to explore more of the middle-ground?

Some questions that occur to me are:

  • Why not encourage people to reuse strong identifiers rather than create new ones. This reduces need for owl:sameAs linking, and makes it even easier to merge data.
  • Can smushing and approaches to using rdfs:seeAlso be more widely promoted/discussed as an approach to linking/fusion?
  • Can we create simple data annotation tools that let people contribute to the web of data without requiring that they follow all of the Linked Data principles?

The notion of Annotated Data I’ve described in this post is an attempt to start that conversation. Because it lowers the bar to contribution, it may be easier to move people up the “on ramp” to contributing to the web of data. And arguably as the web of data grows, increasingly what people and organizations will be doing is annotating existing resources rather than creating new ones.

As a concrete use cases, why not encourage publishers to simply publish RDF documents listing the foaf:topic’s of their content, but using dbpedia, or Freebase, or OpenCalais URIs as the topic URIs? This is simpler than publishing full Linked Data, is lower cost, and is fairly trivial to do using RDFa. They might later want to adopt more of the Linked Data publishing principles if they want more control over their URI schemes or are prepared to invest deeper in the technology.

Heresy or just good use of the full range of hypertext publishing mechanisms we have in RDF? Let me know your thoughts.


20
Nov 09

Web Integrated Data

Last Friday I spoke at the Open Knowledge Foundation Open Data & The Semantic Web event. I was giving the opening talk of the day and thought that I’d take the opportunity to lay out a view that I’ve been meaning to articulate for some time: that integrating data with the web maximises its utility. Moving from data dumps, through APIs, and to Linked Data we maximize utility by reducing the amount of effort required to interact with data.

While there’s clearly still a lot of work to do around creating ways to visualise and explore Linked Data, the simply utility of being able to browse a dataset means that we move beyond publishing for a developer audience to publishing for anyone who can wield a browser. This is the angle to the Semantic Web vision that is most often overlooked in my opinion.

Developers often claim that “I can do the same thing using technology X, so why use technology Y”. In this early adopter phase of the Semantic Web its perfectly valid and important to critique the technology; to measure its ease of use and benefits for developers. But for me the end game is to move to a world where anyone can easily do complex manipulations on data — without resorting to writing code — because there’s enough machine support to make it achievable. That’s what standard vocabularies and a common data model enables. And its a natural part of the evolution towards increasingly declarative ways of manipulating information.

I’ll do a proper write-up of the presentation some other time, but for now here are the slides:


23
Oct 08

Explaining REST and Hypertext: Spam-E the Spam Cleaning Robot

I’m going to add to Sam Ruby’s amusement and throw in my attempt to explicate some of Roy Fielding’s recent discussion of what makes an API RESTful. If you’ve not read the post and all the comments then I encourage you to do so: there’s some great tidbits in there that have certainly given me pause for thought.
The following attempts to illustrate my understanding of REST. Perhaps bizarrely, I’ve chosen to focus more on the client than on the design of the server, e.g. what resources it exposes, etc. This is because I don’t think enough focus has been placed on the client, particularly when it comes to the hypermedia constraint. And I think that often, when we focus on how to design an “API”, we’re glossing over some important aspects of the REST architecture which includes after all, other types of actors, including both clients and intermediaries.
I’ve also deliberately chosen not to draw much on existing specifications, again its too easy to muddy the waters with irrelevant details.
Anyway, I’m well prepared to stand corrected on any or all of the below. Will be interested to hear if anyone has any comments.
Lets imagine there are two mime types.
The first is called application/x-wiki-description. It define a JSON format that describes the basic structure of a Wiki website. The format includes a mixture of simple data items, URIs and URI templates that collectively describe:

  • the name of the wiki
  • the email address of the administrator
  • a link to the Recent Changes resource
  • a link to the Main page
  • a link to the license statement
  • a link to the search page (as a URI template, that may include a search term)
  • a link to parameterized RSS feed (as a URI template that may include a date)

Another mime type is application/x-wiki-page-versions. This is another JSON based format that describes the version history of a wiki page. The format is an ordered collection of links. Each resource in that list is a prior version of the wiki page; the most recent page is first in the list.
Spam-E is a little web robot that has been programmed with the smarts to understand several mime types:

  • application/x-wiki-description
  • application/x-wiki-page-versions
  • RSS and Atom
  • XHTML

Spam-E also understands a profile of XHTML that defines two elements: one that points to a resource capable of serving wiki descriptions, another that points to a resource that can return wiki page version descriptions..
Spam-E has internal logic that has been designed to detect SPAM in XHTML pages. It also has a fully functioning HTTP client. And it also has been programmed with logic appropriate to processing those specific media types.
Initially, when starting Spam-E does nothing. It waits to receive a link, e.g. via a simple user interface. Its in a steady state waiting for input.
Spam-E then receives a link. The robot immediates dereferences the link. It does so by submitting a GET request to the URL, and includes an Accept header:

Accept: x-wiki/description;q=1.0, x-wiki/page-versions;q=0.9, application/xhtml+xml;q=0.8, application/atom+xml;q=0.5, application/rss+xml;q=0.4

This clearly states Spam-E’s preference to receive specific mime-types.
In this instance is receives an XHTML document in return. Not ideal, but Spam-E knows how to handle it. After parsing it, it turns out that this is not a specific profile of XHTML that Spam-E understands, so it simply extract all the anchor elements from the file and uses it to widen its search for wiki spam. Another way to say this is that Spam-E has changed its status to one of searching. This state transition has been triggered by following a link, receiving and processing a specific mimetype. This is “hypermedia as the engine of application state” in action.
Spam-E performs this deference-parse-traverse operation several times before finding an XHTML document that conforms to the profile it understands. The document contains a link to a resource that should be capable of serving a wiki description representation.
Spam-E is now in discovery mode. Spam-E uses an Accept header of application/x-wiki-description when following the link and is returned a matching representation. Spam-E parses the JSON and now has additional information at its disposal: it knows how to search the wiki, how to find the RSS feed, how to contact the wiki administrator, etc.
Spam-E now enters Spam Detection mode. It requests, with a suitable Accept header, the recent changes resource, stating a preference for Atom documents. It instead gets an RSS feed, but thats fine because Spam-E still knows how to process that. For each entry in the feed, Spam-E requests the wiki page, using an Accept header of application/xhtml+xml.
Spam-E now tries to find if there is spam on the page by applying its local spam detection logic. In this instance Spam-E discovers some spam on the page. It checks the XHTML document it was returned and discovers that it conforms to a known profile and that embedded in a link element is a reference to the “versions” resource. Spam-E dereferences this link using an Accept header of application/x-wiki-page-versions.
Spam-E, who is now in Spam Cleaning mode, fetches each version in turn and performs spam detection on it. If spam is found, then Spam-E performs a DELETE request on the URI. This will remove that version of the wiki page from the wiki. Someone browsing the original URI of the page will now see an earlier, spam free version.
Once it has finished its cycle of spam detection and cleaning, Spam-E reverts to search mode until it runs out of new URIs.
There are several important points to underline here:
Firstly, at no point did the authors of Spam-E have to have any prior knowledge about the URL structure of any site that the robot might visit. All that Spam-E was programmed with was logic relating to some defined media types (or extension points of a media type in the case of the XHTML profiles) and the basic semantics of HTTP.
Secondly, no one had to publish any service description documents, or define any API end points. No one had to define what operations could be carried out on specific resources, or what response codes would be returned. All information was found by traversing links and by following the semantics of HTTP.
Thirdly, the Spam-E application basically went through a series of state transitions triggered by what media types it received when requesting certain URIs. The application is basically a simple state machine.
Anyway, hopefully that is a useful example. Again, I’m very happy to take feedback. Comments are disabled on this blog, but feel free to drop me a mail (see the Feedback link).


21
Sep 08

Life With Playstation

Earlier today I was playing with the new Life with PlayStation which is available as a free upgrade to the older “Folding @ Home” application that originally shipped with the PS3.
The new application looks like it is a step towards generalizing the existing interface, which is a “Google Earth-lite” style zoomable, pannable, 3D globe albeit with much less detail than its desktop equivalent. The main new feature is integrated weather reports and news feeds from the capital cities of 60 countries. You can read more about it on the website and watch a video demo.
What intrigued me was the possibility that Sony may decide to open this up further. They’re clearing expecting there to be more “channels”, which is their term for overlays that can be displayed on the globe. At present only the news, plus the older Folding@Home channels are available, but it’d be fantastic if this was opened up to web hackers to allow geo apps to be delivered directly to the Playstation. I’ve done some googling around but there’s doesn’t seem to be any discussion about how they intend to add new services, or whether there may be a developer kit.
There is a huge amount of creative work going on in the world of geo-hackery that could be re-targetted for delivery to the PS3 if Sony decide to embrace open-ness. Indeed, other than the currently fairly limited resolution of the map and the need for Sony to provide a way to feed content into their system, there seems to be little in the way of further obstacles.
I also noticed that the software license page explains that the application ships with a “simple cross-platform XML parser” and LiteSQL. An even more exciting leap would be to see a sandboxed Javascript engine in there too, but lets not run before we can walk!


14
Apr 08

Google AppEngine for Personal Web Presence?

Some thinking aloud…
I’ve browsed through the Google App Engine gallery and the applications you can find there at the moment are pretty much what you’d expect: lots of Web 2.0 “share this, share that” sites. These are what you’d expect because firstly they’re the kind of simple application you’d build whilst exploring any new environment. Secondly because they’re exactly the kind of sites that are currently being released every which way you turn.
But for me App Engine is intriguing as it might provide an interesting new perspective on distributing shrink-wrapped packaged software. When Google take the lid off of the number of sign-ups, its going to be a simple matter for anyone to have their own App Engine environment. Forget cheap web hosting and the expensive and configuration overhead that that entails: just sign up for an App Engine account.
App Engine has the potential to provide an enormous number of people with a well-documented stable environment into which an application can be deployed.
It will be interesting to see if anyone seizes on App Engine as an opportunity to create a simple personal application that combines elements of all of the Web 2.0 favourites: bookmarks, blogging, calendar, photos, travel, and perhaps an OpenId provider. One that that makes me the administrator of all of my own data, but doesn’t scrimp on the options for other people to harvest, syndicate and browse what I’m uploading.
At the moment our online identities start out fragmented, because we have to push data into a number of different services. And then we strive for ways to bring that data together and knit it into other sites that we, or our social network, use.
But why not turn this on it’s head? And seize on App Engine as a way to avoid this early fragmentation and instead start out with a centralized, personal web presence; but one which seamlessly integrates with data in other spaces. The potential is in open data, and services that are built around it. So why aren’t we managing our own open data repositories and letting others offer us services against particular aspects of it?
The App Engine environment doesn’t involve any configuration on behalf of the end user, and I suspect you could probably create an App Engine Deployer using App Engine itself. So sign-up, deployment and upgrades could also be pretty straight-forward. Python seems well suited for creating a simple modular web application that could be extended to cover new areas as users needed.
Instead of using lots of different web applications, we can each have our own modular web application that is intimately linked into the web, and becomes the primary repository for the data you want on the web. Data portability follows from the fact that you’d be the administrator of your own data.
This would also change the nature of the kinds of applications that we’d need elsewhere on the web. Instead of lots of specialist databases, we need more generic services and more community/local/temporary aggregations.


29
Jan 07

Quakr

Quakr is a project to build a 3-dimensional world from user contributed photos, a.k.a. some friends having fun with geek hacking. I see they submitted an abstract to XTech too. The blog links to some interesting experiments mashing up Google Maps with a Flash and VRML viewer.
The Quakr 7D Tiltometer is worth viewing too if only for its sheer Blue Peter stylee “build this at home” excellence.