March 29, 2009

Talis Connected Commons launches

Yesterday I attended the OKCon Open Knowledge Conference. The conference -- which was attended by around 70 people (by my rough count) -- brought together a wide variety of people to present on a range of topics from knowledge transfer for sustainable development through to linked data and the semantic web. A really broad range of issues that ranged from the social to the technical. While I'm not sure that the mix always worked, I came away having learnt about a number of interesting projects. I also noticed a definite theme centred on the need for easier publishing and sharing of data and information between development projects.

Which is why I was pleased to be able to announce at the end of my talk a new initiative from Talis called the Talis Connected Commons. We've been working on this plan for a while, so it was great to be able to finally publically share the details. The essence of the scheme is that you can now host public domain data in the Talis Platform for free, and immediately use the existing Platform services to interact with that data. That covers both simple data access and searching features through to a SPARQL endpoint, with outputs in a range of formats including RDF/XML, RSS and JSON.

There are a couple of, quite reasonable, conditions that apply. Firstly the data has to be truly in public domain, which means using one of the currently defined open data licences (i.e. CC0 or Open Data Commons PDDL). Secondly there's an upper limit on the storage: 50 million triples and 10gb of supporting content. But that's plenty of room in which to host some interesting data.

Personally I think this is great news for open data projects as it means that there is an immediately available infrastructure and API into which you can pour your data. And, importantly, retrieve it again afterwards; there are plenty of ways to get data into and out of the Platform. This means that the focus can be on the data collection and publishing, which is where it should be.

There should be a lot of useful and interesting data sets that can be published in this way. I'm expecting that the scheme should be of immediate interest to people working with public sector information and around publishing of open scientific data. For more information on the scheme check out the homepage and the detailed FAQ.

It's great to be working for a company that takes open data this seriously. And is a concrete sign of its commitment to helping build a truly open data commons. (We're hiring, btw.)

Posted by ldodds at 03:17 PM | Feedback? | Semantic Web

February 03, 2009

New SPARQL Working Group Charter

I was really pleased to see that the charter for the SPARQL Working Group is now public. Having made heavy use of SPARQL in a number of personal and commercial projects there have been a number of pain points which, to date, I've only been able to address by resorting to vendor specific extensions. The key ones for me have been support for aggregates as well as handling of collections and containers. These are all on the list of candidate issues for the Working Group to address, so this is good news as far as I'm concerned.

Recently I've been thinking about ways to improve DESCRIBE queries, e.g. by adding a "USING" keyword to specify the description algorithm. Like others, I've found Concise Bounded Descriptions and similar algorithms to be extremely useful when building semantic web applications. Being able to select amongst those algorithms from within a query would be a Good Thing. As improvements to DESCRIBE may also be on the Working Group's agenda, this might save me a bit of work!

Of the other example topics lists in the charter document, the exploration of an XML syntax for SPARQL left me shrugging my shoulders, while the insert/update/delete capability gives me a small bit of concern: its obviously an essential part of the language and the issue does need to be addressed, but there's more than one way to do it. Have the various options been explored? Could SPARQL support more than one update mechanism, just as it supports multiple query forms? After all, all of those forms have valid use cases and, to my mind, there are some useful trade-offs in the different approaches to handling updates too. However I do acknowledge that SPARQL shouldn't become too unwieldy, so maybe this is just wish fulfilment on my part.

Oh, and I still think that query by reference is a good idea.

While we're going to have to wait over a year (based on current estimates) for an updated recommendation, I hope that we'll start to see some alignment between SPARQL implementations before then. Especially around the extension mechanisms and filters.

But this is one area where the semantic web community could really be doing more to help itself: the lack of, say, support for querying collections, would be less of an issue if the community co-ordinated on defining how these extensions might work; have a standard namespace for them; create implementations for the common processors; etc. This worked perfectly well for the XSLT community which addressed similar issues through the EXSLT project. With a degree of community spirit we could have more portable SPARQL queries now, and also encourage some further exploration around extensions, without having to wait for the rubber stamp of the working group. I've been wondering how best to try and foster that. If you're interested then drop me a mail or a tweet (@ldodds).

Posted by ldodds at 09:58 AM | Feedback? | Semantic Web

January 15, 2009

Interesting Papers from CIDR 2009

CIDR 2009 looks like it was an interesting conference, there were a lot of very interesting papers covering a whole range of data management and retrieval issues. The full list of papers can be browsed online, or downloaded as a zip file. There's plenty of good stuff in there ranging from the energy costs of data management, forms of query analysis and computation on "big data", and discussions on managing inconsistency in distributed systems.

Below I've pulled out a few of the papers that particularly caught my eye. You can find some other picks and summary on the Data Beta blog: part 1, and part 2.

Requirements for Science Databases and SciDB from Michael Stonebraker et al, presents the results of a requirement analysis covering the data management needs of scientific researchers in a number of different fields. Interestingly it seems that for none of the fields covered, which includes astronomy, oceanography, biologic, genomics and chemistry, is a relational structure a good fit for the underlying data models used in the data capture or analysis. In most cases an array based system is most suitable, while for biology, chemistry and genomics in particular a graph database would be best; semantic web folk take note. The paper goes on to discuss the design of SciDB which will be an open source array-based database suitable for use in a range of disciplines.

The Case for RodentStore, an Adaptive, Declarative Storage System, Cudre-Mauroux et al, introduces RodentStore an adaptive storage system that can be used at the heart of a number of different data management solutions. The system provides a declarative storage algebra that allows a logical schema to be mapped to a specific physical disk layout. This is interesting as it allows greater experimentation within the storage engine, allowing exploration of how different layouts may be used to optimise performance for specific applications and datasets. The system supports a range of different structures, including multi-dimensional data, and the authors note that the system can be used to manage RDF data.

Principles for Inconsistency, proposes some approaches for cleanly managing inconsistency in distributed applications, providing some useful additional context and implementation experience for those wrapping their heads around the notion of eventual consistency. I'm not sure that'd I'd follow all of these principles, mainly due to the implementation and/or storage overheads, but there's a lot of good common sense here.

Harnessing the Deep Web: Present and Future, Madhavan et al, describes some recent work at Google to explore how to begin surfacing "Deep Web" information and data into search indexes. The Deep Web is defined by them as pages that are currently hidden behind search forms and that are not currently accessible to crawlers through other means. The work essentially involved discovering web forms, analysing existing pages from the same site in order to find candidate values to fill in fields in those forms, then automatically submitting the forms and indexing the results. The authors describe how this approach can be used to help answer factual queries, and is already in production on Google. This probably explains the factual answers that are appearing on search results pages. The approach is clearly in-line with Google's mission to do as much as possible with statistical analysis of document corpora as possible, there's very little synergy with other efforts going on elsewhere, e.g. linked data. There is reference to how understanding the semantics of forms, in particular the valid range of values for a field (e.g. a zip code) and co-dependencies between fields, could improve the results, but the authors also note that they've achieved a high level of accuracy in automated approaches to identifying common fields such as zip code, etc. A proposed further avenue for research is exploration of whether the contents of an underlying relational database can be reconsistuted through automated form submission and scraping of structured data from the resulting pages. Personally I think there are easier ways to achieve greater data publishing on the web! The authors reference some work on a search engine specifically for data surfaced in this way, called Web Tables which I've not looked at yet.

DBMSs Should Talk Back Too, Yannis Ioannidis and Alkis Simitsis, describes some work to explore how database query results and queries themselves can be turned into human-readable text (i.e. the reverse of a typical natural-language query system), arguing that this provides a good foundation for building more accessible data access mechanisms, as well as allowing easier summarisation of what a query is going to do, in order to validate it against the users expectations. The conversion of queries to text was less interesting to me than the exploration of how to walk a logical datamodel to generate text. I've very briefly explored summarising data in FOAF files, in order to generate an audible report using a text-to-speech engine, and so it was interesting to me to see that the authors were using a graph based representation of the data model to drive their engine. Class and relation labelling, with textual templates, are a key part of the system, and it seems much of this would work well against RDF datasets.

SocialScope: Enabling Information Discovery on Social Content Sites, Amer-Yahia et al, is a broad paper that introduces SocialScope a logical architecture for managing, analysing and presentation information derived from social content graphs. The paper introduces a logical algebra for describing operations on the social graph, e.g. producing recommendations based on analysis of a users social network; introduces a categorisation for types of content present in the social graph and means for managing it; and also discusses some ways to present results of searches against the content graph (e.g. for travel recommendations) using different facets and explanations of how recommendations are derived.

Posted by ldodds at 11:52 AM | Feedback? | Programming