Uncategorized


27
Dec 09

Thoughts on Enterprise Linked Data

There have been a number of discussions about “Enterprise Linked Data” recently, and I took part on a panel on precisely that topic at ESTC 2009. Unfortunately the panel was cut short due to time pressures so I didn’t get chance to say everything I’d hoped. In lieu of that debate here’s a blog post containing a few thoughts on the subject.

When we refer to enterprise use of Linked Data, there are a number of different facets to that discussion which are worth highlighting. In my opinion the issues and justifications relating to each of them are quite different. So different in fact that we’re in danger of having a confused debate unless we tease out this different aspects.

Aspects of the Debate

In my view there are three facets to the discussion:

  • Publishing Linked Data, the key question here being: What does an Enterprise have to benefit by publishing Linked Data?
  • Consuming Linked Data: What does an Enterprise have to benefit from consuming Linked Data?
  • Adopting Linked Data: What benefits can an Enterprise gain by deploying Linked Data technologies internally?

I think these facets whilst obviously closely related are largely orthogonal. For example I could see a scenario in which an organization consumed Linked Data but didn’t store or use it as RDF, but just fed it into existing applications. Similarly businesses could clearly adopt Linked Data as a technology without publishing or using any data to the web at all.

These issues are also largely orthogonal to the Open Data discussion: an enterprise might use, consume and publish Linked Data but this might not be completely open for others to reuse. The data may only be available behind the firewall, amongst authorised business partners, or only available to licensed third-parties. So, while the issue as to whether to publish open data is a very important aspect of the discussion, its not a defining one.

Here’s a few thoughts on each of these different facets.

Publishing Linked Data

So why might an enterprise publish Linked Data? And if that is a worthwhile goal, then is it clear how to achieve it? Lets tackle the second question first as its the simplest.

There is an increasingly large amount of good advice available online, as well as tools and applications, to support the publishing of Linked Data. We’re making good strides towards making the important transition from moving Linked Data out of the research area and into the hands of actual practitioners. The How to Publish Linked Data on the Web tutorial is an great resource but to my mind Jeni Tennison’s recent series on publishing Linked Data is an excellent end-to-end guide full of great practical advice.

We can declare victory when someone writes the O’Reilly book on the subject and do for Linked Data what RESTful Web Services did for REST. (And the two would make great companion pieces).

But technology issues aside, what are the benefits to an organization in publishing Linked Data? There are several ways to approach answering that question but I think in most discussions Linked Data tends to get compared with Web APIs. The value of creating an API is now reasonably well understood, and many of the benefits that come from opening data through an API also apply to Linked Data.

However the argument that Linked Data married with a SPARQL endpoint is as easy for developers to use as a Web API is still a little weak at this stage. SPARQL can be off-putting for developers used to simpler more tightly defined APIs. As a community we ought to consider it as a power tool and look for ways to make it easier to get started with. It’s also worth recognising that a search API is also a useful addition to a SPARQL endpoint as part of Linked Data deployment.

But publishing Linked Data can’t be directly compared to just creating an API, because its also largely a pattern for web publishing in general. Its increasingly easier to instrument existing content management systems to expose RDF(a) and Linked Data. So rather than create a custom API, which will involve expensive development costs, particularly if its going to scale, its possible to simply expose Linked Data as part of an existing website.

By following the Linked Data pattern for web publishing, in particular the use of strong identifiers, an enterprise can end up with a single point of presence on the web for publishing all of its human and machine-readable data, resulting in a website that is strongly Search Engine Optimised. Search engines can better crawl and index well structured websites and are increasingly ingesting embedded RDFa to improve search results and rankings. That’s a strong incentive to publish Linked Data by itself.

Adopting Linked Data, particularly as part of a reorganization of an existing web presence, could deliver improved search engine rankings and exposure of content whilst saving on the costs of developing and running a custom API. The longer term benefits of being part of the growing web of data can be the icing on the cake.

Consuming Linked Data

Next we can consider why an enterprise might want to consume Linked Data.

To my knowledge organizations are currently only publishing Linked Open Data (albeit with some wide variations in licensing terms), so we’ll skip for the present whether enterprises have an option of consuming non-open Linked Data, e.g. as part of a privately licensed dataset.

The LOD Cloud is still growing and provides a great resource of highly interlinked data. The main issues that face an organization consuming this data are ones of quantity (there’s still a lot more data that could be available); quality (how good is the data, and how well is it modelled); and trust (picking and choosing reliable sources).

To some extent these issues face any organization that begins relying on a third-party API or dataset. However at present a lot of the data in the LOD cloud is still from secondary sources. The same can’t be said for the majority of web APIs, which tend to be published by the original curators of the data.

These issues should resolve themselves over time as more primary sources join the LOD cloud. Because Linked Data is all based on the same data model bulk loading and merging data from external sources is very simple. This gives enterprises the option of creating their own mirrors of LOD data sources which will provide some additional reassurances around stability and longevity.

Linked Data, with its reliance on strong identifiers, is much easier to navigate and process than other sources, even if you’re not storing the results of that processing as RDF. There’s also a much greater chance of serendipity, resulting in the discovery of new data sources and new data items. Whereas there is virtually no serendipity in a Web API as each API needs to be explicitly integrated.

But this benefit is only going to become evident if we continue to put effort into helping (enterprise) developers understand how to consume Linked Data. E.g. as part of existing frameworks or using new data integration patterns is another area that needs more attention. The Consuming Linked Data tutorial at ISWC 2009 was a good step in that direction, although the message needs to be circulated wider, outside of the core semantic web community.

In my opinion it will be easier for enterprises to consume Linked Data if they first begin to publish it. By publishing data they are putting their identifiers out into the wild. These identifiers become points for annotation and reuse by the community, creating liminal zones from which the enterprise can harvest and filter useful data. This is a benefit that I think is unique to Linked Data as with an Web API the end results are typically mashups or widgets displaying in a third-party application; these are just new silos one step removed from the data publisher.

Adopting Linked Data

Finally, what value could be gained if an organization adopts Linked Data internally as a means to manage and integrate data behind the firewall?

The issues and potential benefits here are largely a mixture of the above, except that there are little or no issues with trust as all of the data comes from known sources. In a typical enterprise environment Linked Data as an integration technology will be compared to a wider range of systems ranging from integrated developer tools through to middleware systems. There’s a reason why SOAP based systems are still well used in enterprise IT as most organizations aren’t (yet?) internally organized as if they were true microcosms of the web.

Its interesting to see that Linked Data can potentially provide a means for solving many of the issues that Master Data Management is trying to address. Linked Data encourages strong identifiers; clean modelling; and linking to, rather than replicating data. These are core issues for data consolidation within the enterprise. Coupled with the ability to link out to data that is part of the LOD Cloud, or published by business partners, Linked Data has the potential to provide a unifying infrastructure for managing both internal and external data sources.

Its worth noting however that semantic technologies in general, e.g. document analysis, entity extraction, reasoning and ontologies seem to be much more widely deployed in enterprise systems than Linked Data. This is no doubt in large part because the advantages of those technologies may currently be much more easily articulated as they’re more easily packaged into a product.

Summary

In this post I wanted to tease out some of the questions that underpin the discussions about enterprise adoption of Linked Data. I’ve presented a few thoughts on those questions and I’d love to hear your opinions.

Along the way I’ve attempted to highlight some areas where we need to focus to help transition from a researcher-led to a practioner-led community. More data, more documentation, and more tools are the key themes.


5
Nov 09

Describing SPARQL Extension Functions

At the end of my recent post on Surveying and Classifying SPARQL Extensions I noted that I wanted to help encourage implementors to publish useful documentation about their SPARQL Extensions. If you’re interested in the current state of that survey then you can check out my current spreadsheet listing known extension functions. There are more to add there, but its a good summary of the current state of play.

At VoCamp DC last week I did some work on designing a small vocabulary for describing SPARQL Extensions. The first draft of this is online here: SPARQL Extension Descriptions. There’s a little bit of background on the Vocamp wiki too, if you want to see my working :) .

Here’s an example of the vocabulary in use, describing some extensions to the ARQ SPARQL Engine:


<http://jena.hpl.hp.com/ARQ/function> a sed:FunctionLibrary;
  dc:title "ARQ Function Library";
  dc:description "A collection of SPARQL extension functions
      implemented by the ARQ engine";
  foaf:homepage <http://jena.sourceforge.net/ARQ/library-function.html>;
  sed:includes <http://jena.hpl.hp.com/ARQ/function#sha1sum>.

<http://jena.hpl.hp.com/ARQ/function#sha1sum>
  a ssd:ScalarFunction;
  rdfs:label "sha1sum";
  dc:description "Calculate the SHA1 checksum
       of a literal or URI.";
  sed:includedIn <http://jena.hpl.hp.com/ARQ/function#>.

<http://jena.hpl.hp.com/ARQ#self> a sed:SparqlProcessor;
  foaf:homepage <http://jena.hpl.hp.com/ARQ>;
  rdfs:label "ARQ";
  sed:implementsLibrary <http://jena.hpl.hp.com/ARQ/function>;

Ideally what should happen is that every URI associated with a filter function and property function should be dereferencable, and that terms from this vocabulary be used to describe those functions. There’s a lot more detail that could be included, but I suspect this is sufficient to cover the primary use cases, i.e. documentation and validation.

The draft SPARQL 1.1. Service Description specification does cover some of this ground, but falls short in a few places, and I think some of what I’ve described here could usefully be folded into that specification without greatly extending its scope. But thats a matter for the Working Group to decide.

One specific issue is that the specification doesn’t currently recognise “functional predicates” (to use Lee Feigenbaum’s preferred term; others include “property functions” and “magic properties”) as a distinct class of extensions. They clearly exist, so I think we should have a means to describe them. In fact arguably they are the most important class of SPARQL extensions that need describing.

Filter functions are relatively well understood and can clearly be identified based on where they appear in a query. Language extensions will generate a parser error if an endpoint doesn’t support them, so will easily be caught. But functional predicates use existing turtle triple pattern syntax, but typically involve triggering custom logic in the SPARQL processor, rather than actually appearing as triples within the dataset. Without the ability to dereference their URIs and identify them as a functional predicate, a SPARQL engine will simply treat them as a triple pattern and fail silently, rather than complaining that the extension is not supported.

The following example query illustrates this:


PREFIX list: <http://jena.hpl.hp.com/ARQ/list#>
PREFIX func: <http://jena.hpl.hp.com/ARQ/function#>
PREFIX dc: <http://purl.org/dc/terms/>
PREFIX ex: <http://example.org/vocab/>

SELECT ?doc ?contributor WHERE {
   ?s dc:modified ?created.
   ?s ex:authors ?authorList.
   ?authorList list:member ?author.
   LET ( ?contributor := ?author )
   FILTER ( ?created < func:now() )
}

The above query contains 3 extensions: a language extension (LET); a filter function (func:now()); and a functional predicate (list:member). Without prior knowledge of that predicate, or the ability to dereference its URI, there’s no way to know that the functional predicate is not really a triple that the query author is attempting to match against, rather than an extension.

I’d like to urge all implementors to consider making their extension URIs dereferencable. The schema I’ve drafted is very light-weight so shouldn’t be difficult to support. I’m also very happy to take comments on its design. I’m intending it as a starting point for others to help build upon.


27
Oct 08

Cross Pollination

I binged on TED talks whilst travelling over to the ISWC 2008 conference. One of those that I enjoyed the most was “Design and the elastic mind” by Paola Antonelli. Who doesn’t get a kick out of seeing some great design concepts?
One item that caught my attention was Antonelli’s reference to a regular “salon” that brought together designers and scientists in order to explore common ground and share ideas.
As the power of what is possible on the web increases, it strikes me that we need a bit more of this kind of cross-pollination between development and design. In order to encourage a bit more lateral thinking and a fuller exploration of the potential, and maybe kick us all in some new directions.
Looks like I’m not the only one thinking this: Tim Bray is encouraging folk to branch out and Ian Dickinson wants to be a “devsigner” when he grows up.
I think this is particularly true in the Semantic Web space. I’ve yet to see a really striking semantic web application that isn’t essentially a clone of an existing service or really does justice to the data. Are there exciting, challenging, or innovative user interfaces that I’ve missed? Parallax is great, but what else is there? What needs to happen to encourage more innovation?
I can remember a couple of years back when all of a sudden there were information architects and interaction designers at conferences like XTech, when it became clear that there were a lot of synergies between open data publishing and good (website) design. How long before this happens at Semantic Web conferences? There’s a couple of papers on this topic at ISWC, and a workshop next year. But what else can we do? How do we foster some good cross-pollination?


13
Dec 06

Messages From the Future

On the Web, you need to be able to process messages from the future.

Interesting post from Mark Baker about XML validation and web services:
Validation considered harmful


2
Aug 06

Blackberries from the Garden

Blackberries from the Garden

Blackberries from the Garden,
originally uploaded by ldodds.

Last year it was a bumper crop of raspberries. We had some, but not many blackberries. This year the raspberries didn’t do as well (although we still had heaps), but the blackberries have really done well. Over 2 kilos we’ve had out of the patch so far, and there are still more ripening. Must be all this sun.

Quite pleased. Although it doesn’t require much gardening skill!


30
Aug 05

Lunch Hour Game

Our daily office random lunch hour discussion veered into the topic of reality TV today, namely: what new shows could we make up? Come on, you’ve all done it!
Here are my contributions:
1950’s Wife Swap: Like Wife Swap except your exchange spouse with a family from the 1950s. Hilarity ensues. Note: idea slightly limited by need for time travel and/or availability of character actors.
Ready, Steady, Survive!: Ray Mears takes a number of well-known cooks into the wilderness and then presents them with 5 random ingredients harvested from Nature. The winner is the chef to make the best dish out of the available bush tucker.
Habitat Swap: Presented by David Attenborough and Davina McCall this show selects two animals and forces them to swap habitats for a week. The viewers get to follow the travails of the beasts as they attempt to evolve within a week. The winner is presented with a wildlife preservation order. First guests are a red ant and a black ant.
Call Yourself a Pharoah?: Sarah Beeny presents this show following the efforts of several tyrants to construct massive monuments and/or tombs using a thousand slaves each. Beeney provides constructive advice on managing a large scale project, e.g. transportation of massive stone blocks, costing the plaster work required for a pyramid, etc.
Any better than that?


4
Jul 05

The fruit of our labours

The fruit of our labours

The fruit of our labours,
originally uploaded by ldodds.

Testing out the Flickr photo blogging feature. Thought I’d show off the massive crop of raspberries and tayberries we picked this weekend, after I’d finished landscaping the new patio. Good to get out and do something non-geeky for a change.
The kids enjoyed the fruit picking, and I’m looking forward until the blackberries ripen. Shouldn’t be too long now. Raspberry based recipes greatly received!


10
Mar 05

My First Computer

Sinclair ZX Spectrum
A scan of the promotional flier for the Sinclair ZX Spectrum that I carried round for months prior to my parents buying me a 48K Spectrum for Christmas.
Click through to the larger image to read the marketing text. Here’s some extracts:
“Professional power — personal computer price!”
“Your ZX Spectrum comes with a mains adaptor and all the necessary leads to connect to most cassette records and TVs (colour or black and white)”
“…later this year there will be Microdrives for massive amounts of extra on-line storage, plus an RS232/network interface board”
“Sound — BEEP command with variable pitch and duration”
“High speed LOAD & SAVE — 16K in 100 seconds via cassette, with VERIFY and MERGE for programs and separate data files.”
I learnt to program from those handy Spectrum BASIC manuals mentioned in the advert supplemented with weekly doses of Input Magazine; never did get the hang of assembly or machine code though. Not beyond a few peeks and pokes lifted from the ever trusty Crash magazine, covers of which (along with CV&G) still adorn some of my old school books lurking in the attic.


3
Sep 04

Public Collections of RDF

Bob DuCharme is looking for public collections of RDF.
He’s compiled an initial list and is looking for further examples of, ideally large, data sets.


14
Jul 04

Yep That’s Me

A view of my del.icio.us bookmarks:
extisp.icio.us – ldodds
Pretty accurate with respect to my interests these days. The Java/Speech tag is overblown though just because I’ve not marked other Java related pages.
It’s just a damn shame I can’t make it to FOAFCamp or the FOAF Workshop. Family holidays and work deadlines have crowded out my schedule.
Link courtesy of Many-to-Many.