Programming


6
Apr 10

Linked Data Patterns: a free book for practitioners

A few months ago Ian Davis and I were chatting about some new approaches to helping practitioners climb the learning curve around Linked Data, RDF and related technologies. We were both keen to help communicate the value of Linked Data, share knowledge amongst practitioners, and to encourage the community to converge on best practices. We kicked around a number of different ideas in this vein.

For example, Ian was keen to provide guidance as to how to mix and match different vocabularies to achieve a particular goal, like describing a person or a book. Having a ready reference containing recipes for these common tasks would address a number of goals. He’s ended up exploring that idea further in the recently released Schemapedia. If you’ve not seen it yet, then you should take a look. It provides a really nice way to navigate through RDF vocabularies and explore their intersections.

The other thing that we discussed was Design Patterns. I’ve been a Design Pattern nut for some time now. Discovering them was something of a right of passage for me during my Master’s dissertation. I’d spent weeks revising and honing a design for the distributed system I was building, only to discover that what I’d produced was already documented as a design pattern in an obscure corner of the research literature. While I’d clearly reinvented the wheel, the discovery not only provided external validation for what I’d produced, but also neatly illustrated the benefit of using design patterns to share knowledge and experience within a community. Knowing when to apply particular patterns is a key skill for any developer, and the terms are a part of the design vocabulary we all share.

I suggested to Ian that we explore writing some patterns for Linked Data. Patterns for assigning identifiers, modelling data, as well as application development. We experimented with this for a while but ended up parking the discussion for a few months whilst other priorities intervened.

I recently revived the project. It’s pretty clear to me that there’s still a big skills gap between experienced practitioners and those seeking to apply the technology. I think the current situation is reminiscent of the move of OO programming from the research lab out into the developer community; design patterns played a key role there too.

Ian and I have decided to share this with the community as an on-line book, a pattern catalogue that covers a range of different use cases. We started out with about half a dozen patterns, but over the last few weeks I’ve expanded that figure to thirty. I’ve still got a number on my short-list (more than a dozen, I think) but it’s time to start sharing this with the community. The work won’t ever be complete as the space is still unfolding, it will just get refined over time.

You can read the book online at http://patterns.dataincubator.org.

The work is licensed under a Creative Commons Attribution license so you’re free to use it as you see fit, but please attribute the source. If you want to download it, then there’s a PDF, and an EPUB too. We’re using DocBook for the text so there will be a number of different access options.

I’ll stress that this is a very early draft, so be gentle. But we’d love to hear your comments.


20
Oct 09

Surveying and Classifying SPARQL Extensions

I realised recently that, while a lot of work has been done on creating and exploring interesting extensions to the SPARQL query language, there has yet to be a systematic survey of the range of different extensions that are currently implemented in various RDF triplestores. Or if there has been a survey, then I’ve clearly missed it.

In order to get a better idea of what kinds of extensions are available I’ve set myself the task of surveying those currently implemented. I intend to write-up and share the results of that work through this blog.

Rationale

I think that pulling together a list of extensions is a useful activity which should:

  • Help researchers and implementors to have a clearer view of existing work, thereby encouraging further experimentation
  • Promote convergence on a core set of useful extensions that could be implemented across a number of triplestores.
  • Help users to have a clearer understanding of what SPARQL extensions are currently supported in particular triplestores, letting them make informed decisions about which extensions to use when writing and sharing queries

It looks like the SPARQL Working Group may well be adding a standard library of extension functions into the next revision of the query language so the timing of this work should help contribute to that effort. However I’m looking beyond their immediate goals and hope to encourage the implementor community to explore models simple to the EXSLT effort which has been successful in creating a set of community-designed extensions for XSLT transformations. I see no reason why the same process can’t be applied to SPARQL extensions.

Clarity of which extensions are portable across triplestores is important to allow users to experiment with various triplestore implementations and services. If data is going to be truly portable, then this will be an important consideration.

With that in mind I’ve begun digging into the available documentation for a number of different triplestores. I’ve decided to organize my work by surveying each of the three different types of SPARQL extension.

Types of SPARQL Extension Function

Its possible to extend the SPARQL query language in any of the following three ways:

  • Extension Functions
  • Property Functions (aka “Magic Predicates”)
  • Language Extensions

Lets look at each of these in turn.

Extension Functions

Extension Functions are explicitly described by the current SPARQL specification under the banner of “extensible value testing“. The standard library of extensions that may be added to SPARQL 1.1 will fall into this category. Extension Functions are simple function calls that can be used within a FILTER in a SPARQL query to carry out some specific extra logic that cannot be handled by matching triple patterns. Examples of extension functions include substring testing, string concatenation, date tests, etc.

The specification indicates that these extension functions should have a unique URI, allowing them to be globally identified. Few engines are publishing useful information at these URIs, but this seems like it would be a useful thing to do. These URIs should be grounded in the web too.

Property Functions

Property Functions (aka “Magic Predicates”, or “Magic Properties”) are extensions to the triple matching process that is carried out when a SPARQL query is executed. This means that property functions don’t appear in a FILTER expression like an extension function. They instead appear within the graph pattern of the query. Unlike extension functions which have a syntax like a conventional functional call, property functions use turtle syntax and appear, to the untrained eye, as standard triple patterns.

For example, as property function that could split a resource URI into a namespace and a localname might look like this in a SPARQL query:


?uri a rdfs:Class.
?uri ex:splitURI (?namespace ?localname).

In that example the the property function ex:splitURI has as its input each of the URIs that are bound to the ?uri variable, and as its output binds the namespace URI and localname of those URIs to two new variables.

There are other ways to structure the inputs and outputs of a property function, depending on its purpose, but the important things to recognise are that:

  • the property function is written as a conventional triple pattern
  • parameters can be passed from either the subject or object portions of the triple (or potentially both)
  • similarly, output can be bound to variables that appear in either the subject or object portions of the triple
  • one technique for passing multiple parameters or generating multiple output values is to allow specification of an RDF list in the object portion of the triple

Property functions are very powerful as they can allow arbitrary complex logic to be used to extend the triple matching process. One common use is to extend the matching process by calling out to specialised indices or logic, e.g. for full-text indexing or geospatial functions and reasoning.

It is worth noting that Property Functions are not explicitly licensed by the current SPARQL specification. The specification does not describe them at all: they are simply allowed by the fact that they conform to the overall SPARQL grammar.

Testing whether a query uses Property Functions would therefore require a validator (such as the one that Dan Brickley describes here) to either have explicit knowledge of the function, e.g. based on its URI, or for implementors to publish some useful information at those locations so that a validator might determine whether a specific predicate is actually a “real” predicate or an extension through dereferencing the URI. I’m not aware of any implementation that currently does this.

Language Extensions

The final category of SPARQL extensions are extensions to the language itself. This type of extension involves amending the grammar of the language to include new operators, keywords, and types of expression. Examples of this type of extensions include sub-queries and aggregates (e.g. min and max). The forthcoming SPARQL 1.1 specification will standardise these and a few other language extensions that have been commonly implemented.

Arguably, if one changes the grammar of a language then you’re creating a new language: “SPARQL plus some extensions”. So some care needs to be taken with respect to this type of extension if one wants queries to be portable.

In my view while there is plenty of scope for the community to collaborate and converge on common extension of all of the types I’ve described here, the best place for language extensions to be formally ratified and agreed on is through the SPARQL Working Group. I personally don’t expect the Working Group to have to, or want to sign-off on every extension function or property function, but interoperability is ultimately best served by co-ordinating language extensions through the Working Group. Naturally this should happen after the implementor community have had a period of experimentation and research. This is obviously the process that has happened to date, and hopefully this will continue as the language continues to evolve. A bit of collective action ought to help ensure interoperability in other areas.

A Survey

For my survey of SPARQL extensions I’ve decided to tackle things in the order in which I have presented them here: I will first look at Extension Functions, then Property Functions, and then Language Extensions. For the rationale and reasons I’ve already outlined, I think the community is best served by organizing itself around standardising two of those types of extensions. And Extension Functions seem like the lowest hanging fruit.

I’m intending to do the survey in as open a way as possible, and want to ensure that I include as many different implementations as possible. Having said that initially I’m going to impose some editorial control simply to ensure consistency and quality. Implementors feel free to drop me a line providing me with information on your extensions or preferably pointers to the relevant documentation. I’ll also stress that while this survey has obvious relevance for my day job, that this is a personal project so things will progress as quickly as I’m able to find some time to push things forward.

I’m going to send regular status updates to the public-sparql-dev mailing list as that is the correct place for further discussion. I’ll also summarize my findings in further blog posts here. I’ve already begun the process of cataloguing Extension Functions as you can see by my recent email to the mailing list. I still have to include some additional information helpfully provided by OpenLink and to also update the entries for Mulgara to list its support for some of the EXSLT functions.

One other task I have on my list is to help provide some guidance on how implementors should publish information about their SPARQL extensions. It would be useful to have some descriptive metadata for these available from the relevant URIs. I’m intending to spend some time at Vocamp DC pulling together a vocabulary for that purpose. Let me know if you’re attending and want to collaborate.


15
Jan 09

Interesting Papers from CIDR 2009

CIDR 2009 looks like it was an interesting conference, there were a lot of very interesting papers covering a whole range of data management and retrieval issues. The full list of papers can be browsed online, or downloaded as a zip file. There’s plenty of good stuff in there ranging from the energy costs of data management, forms of query analysis and computation on “big data”, and discussions on managing inconsistency in distributed systems.
Below I’ve pulled out a few of the papers that particularly caught my eye. You can find some other picks and summary on the Data Beta blog: part 1, and part 2.
Requirements for Science Databases and SciDB from Michael Stonebraker et al, presents the results of a requirement analysis covering the data management needs of scientific researchers in a number of different fields. Interestingly it seems that for none of the fields covered, which includes astronomy, oceanography, biologic, genomics and chemistry, is a relational structure a good fit for the underlying data models used in the data capture or analysis. In most cases an array based system is most suitable, while for biology, chemistry and genomics in particular a graph database would be best; semantic web folk take note. The paper goes on to discuss the design of SciDB which will be an open source array-based database suitable for use in a range of disciplines.
The Case for RodentStore, an Adaptive, Declarative Storage System, Cudre-Mauroux et al, introduces RodentStore an adaptive storage system that can be used at the heart of a number of different data management solutions. The system provides a declarative storage algebra that allows a logical schema to be mapped to a specific physical disk layout. This is interesting as it allows greater experimentation within the storage engine, allowing exploration of how different layouts may be used to optimise performance for specific applications and datasets. The system supports a range of different structures, including multi-dimensional data, and the authors note that the system can be used to manage RDF data.
Principles for Inconsistency, proposes some approaches for cleanly managing inconsistency in distributed applications, providing some useful additional context and implementation experience for those wrapping their heads around the notion of eventual consistency. I’m not sure that’d I’d follow all of these principles, mainly due to the implementation and/or storage overheads, but there’s a lot of good common sense here.
Harnessing the Deep Web: Present and Future, Madhavan et al, describes some recent work at Google to explore how to begin surfacing “Deep Web” information and data into search indexes. The Deep Web is defined by them as pages that are currently hidden behind search forms and that are not currently accessible to crawlers through other means. The work essentially involved discovering web forms, analysing existing pages from the same site in order to find candidate values to fill in fields in those forms, then automatically submitting the forms and indexing the results. The authors describe how this approach can be used to help answer factual queries, and is already in production on Google. This probably explains the factual answers that are appearing on search results pages. The approach is clearly in-line with Google’s mission to do as much as possible with statistical analysis of document corpora as possible, there’s very little synergy with other efforts going on elsewhere, e.g. linked data. There is reference to how understanding the semantics of forms, in particular the valid range of values for a field (e.g. a zip code) and co-dependencies between fields, could improve the results, but the authors also note that they’ve achieved a high level of accuracy in automated approaches to identifying common fields such as zip code, etc. A proposed further avenue for research is exploration of whether the contents of an underlying relational database can be reconsistuted through automated form submission and scraping of structured data from the resulting pages. Personally I think there are easier ways to achieve greater data publishing on the web! The authors reference some work on a search engine specifically for data surfaced in this way, called Web Tables which I’ve not looked at yet.
DBMSs Should Talk Back Too, Yannis Ioannidis and Alkis Simitsis, describes some work to explore how database query results and queries themselves can be turned into human-readable text (i.e. the reverse of a typical natural-language query system), arguing that this provides a good foundation for building more accessible data access mechanisms, as well as allowing easier summarisation of what a query is going to do, in order to validate it against the users expectations. The conversion of queries to text was less interesting to me than the exploration of how to walk a logical datamodel to generate text. I’ve very briefly explored summarising data in FOAF files, in order to generate an audible report using a text-to-speech engine, and so it was interesting to me to see that the authors were using a graph based representation of the data model to drive their engine. Class and relation labelling, with textual templates, are a key part of the system, and it seems much of this would work well against RDF datasets.
SocialScope: Enabling Information Discovery on Social Content Sites, Amer-Yahia et al, is a broad paper that introduces SocialScope a logical architecture for managing, analysing and presentation information derived from social content graphs. The paper introduces a logical algebra for describing operations on the social graph, e.g. producing recommendations based on analysis of a users social network; introduces a categorisation for types of content present in the social graph and means for managing it; and also discusses some ways to present results of searches against the content graph (e.g. for travel recommendations) using different facets and explanations of how recommendations are derived.


14
Jun 05

IANAL

In an attempt to put my various projects into the Public Domain, it seems that I’ve caused some confusion.
All I want to do is the following: label my code as being in the Public Domain, but require that people at least acknowledge the fact that they’re using something I wrote. I’d prefer it if people didn’t take anything I wrote and make a quick buck out of it, but I’m not adverse to my code being bundled in a payware application. But that’s a nice to have, I basically just want to give stuff away.
This lead me to start adding Creative Commons licences to my work. The Attribution-ShareALike licence seemed to exactly cover my requirements. Previously I’d either not included a licence, or labelled it as “Public Domain”. But I’d seen some code I wrote used verbatim with someone else’s name on it and that naturally upset me. I won’t go into details about who or where, but how hard is it to add an @author tag to Java source (or better yet, leave the one that’s already in there)?
So anyway, the CC licence seemed to fit. However when the Jaikoz developers contacted me a few months ago about reusing my MusicBrainz API they weren’t sure whether they could, as their application is payware. I said they could.
This week Henning Koch emailed me under similar confusion: would his application have to be similarly licensed. I didn’t think so, and that certainly wasn’t my intention. Koch pointed me at the CC FAQ entry that I’d stupidly overlooked:
CC licenses are not written for software. They should not be used for software…
But which one of the many licences should I use? Why does it have to be so difficult to give stuff away? I know creating new open source licences is discouraged but to be frank, its not that easy to pick and choose, and I’m not sure I want to wade through endless legal documents: I want to give stuff away, but be acknowledged. That’s it.
Why does open sourcing software have to be so difficult? It seems to me that the Creative Commons folk could help clean up this mess. They’re wading into scientific research, so why not software?
A nice example of how broken software licencing is, is this summary of the creative commons licences from a Debian perspective. Conclusion: they’re not free. Debian has a reputation for being particularly prescriptive, but this seems a little barking.
I guess the answer is either RTFL (Read the F’ing Licence), or just switch back to a plain “This work is in the Public Domain” statement.


28
Apr 04

Programmers Are Interesting

Another great article from Sean McGrath: The mysteries of flexible software. Bang on the money.
I don’t know how many times I’ve encounted software (and yes, some of my own devising) that has all sorts of wonderful flexibility but in all the wrong places. Time spent factoring applications into layer cakes, introducing endless layers of abstraction may have some benefits, but exactly how often do you go through an application and rip out the entire persistence layer? And when you do, what’s the biggest hurdle: changing the code, or the data migration and testing involved to guarantee that you’ve not broken any of the data integrity? Exactly how often do you swap in and out XML parsers and XSLT engines?
I’ve been a keen advocate of design patterns for some time, but it’s easy to get carried away: achieving a particular design pattern (or set of patterns) becomes a requirement in itself and that in all likelihood isn’t going to affect the success of a product. The “Just Do It” aspect to XP is one obvious reaction to that experience. Renewed interest in more flexible, easy to author languages like Python is perhaps another.
Abstraction ought to be a defense mechanism. If a particular area of code or functionality is continually under change, then introduce some abstraction to help manage that change. Trust your ability to refactor. Don’t over architect too early.


5
Dec 03

Unit Testing PL/SQL

For my sins I’ve been writing a bit of PL/SQL recently. It’s been nearly 4 years since I had to do that in anger and predictably I’ve forgotten way more than I remember. At the time I was responsible for redesigning the database for a Laboratory Information Management System used by researchers at Pfizer looking for new drugs. After redesigning the data model I had to write code to port from one to the other. That was a lot of code, and required a lot of testing. Fun project though, and an interesting application.
Of course now I know all about test driven development and the first thing that occured to me was: “how do I test this stuff?”.

Continue reading →