December 24, 2004

IngentaConnect RSS Feeds

It's with some pleasure that I'm able to announce a little Xmas present from the technology team at Ingenta to the academic and RDF communities: another batch of RSS feeds from IngentaConnect.

To be precise: in excess of 20,000 new RSS feeds containing the latest table of contents data for the academic journals that are still being actively loaded into our databases. Like our friends at Nature, our feeds are available as RSS 1.0, with Dublin Core and Prism metadata at the item level.

Read on for the technical details, including a seasonal(?!) "Easter Egg".

The feeds are clearly advertised from the journal home pages, using the well-known orange RSS icon. Auto-discovery links are also included to make one-click subscriptions possible. If you look at a journal home page, e.g. the Journal of Consciousness Studies (JCS), you'll also see a link to the "archival" feeds that we've been publishing for a few months now: an RSS feed of the most recent issues of that journal. Again, the feeds are RSS 1.0 + Dublin Core + PRISM, and publishing them at both levels allows us to support varied uses of the data, e.g. import in library systems as well as current awareness applications. I know Richard Cameron has been waiting to add our TOC feeds to CiteULike, so I'll be interested to see how people use them.

If the homepage doesn't sport an RSS icon then its quite likely that we're no longer hosting the most recent content of the journal. This happens some times for contractual reasons, but also because journals do fold from time to time (or become merged with other titles). Not all publishers have yet bought into the merits of metadata syndication in this form, and so in one or two places I've been cautious and blocked some titles. This is also the only reason why abstracts are not yet present in the feeds; but this is something I am pursuing further.

Because of the way the feeds are constructed, there is actually an RSS view of every issue on our site, it's just that the advertised feeds use a symbolic pointer to the latest issue. Continuing the above example, you'll see that the JCS feed is at:

http://api.ingentaconnect.com/content/imp/jcs/latest?format=rss

However if you take an issue level URL such as:

http://www.ingentaconnect.com/content/imp/jcs/2004/00000011/00000001

you can extract RSS by simply tweaking the URL to point to api.ingentaconnect.com and adding a format=rss parameter, like so:

http://api.ingentaconnect.com/content/imp/jcs/2004/00000011/00000001?format=rss

The api.ingentaconnect.com domain is the location from which we're going to be building out our public web services. Seems easier to manage that way.

With an RDF/RSS view of any of our issues available so easily this release constitutes more than just an bunch of RSS feeds: it actually exposes the majority of our database to the web as RDF. Hence the "Xmas present" to the RDF community. High on my TODO list for next year is to make this RDF view more explicit by adding a plain RDF view of the site (basically the same metadata shorn of the RSS trappings). FOAF support is another avenue I'm actively exploring.

The "Easter Egg" I added to the site is, ahem, an OPML export of the top-level browse function. Specifically the alphabetic, subject area, and title keyword search functionality. This will allow users to quickly subscribe to whole sections of the catalogue.

For example, if one browses in the Nursing section of the site and does a "View Source" you'll spot an OPML "auto-discovery" link. The use of quotes there is to indicate that, as ever, there seems to be little standardisation in this area of OPML usage. The OPML link is as follows:

http://api.ingentaconnect.com/browsing/BrowseBySubject?j_subject=220&format=opml&j_pagesize=-1

i.e. again change the sub-domain to "api" and add a format attribute, this time requesting an OPML view. The page size indicator is a hack to request all items in the category. If its omitted you'll just get the first page worth of results, sorted by journal name.

Expect more features in this area over the coming months, including feeds of new journals, and RDF Blogroll support; I've had a "Planet" aggregator running against a slice of our feeds for a few weeks now without any problems.

Lastly (and maybe this should have been first), if you spot any problems them please report them to me at ldodds@ingenta.com. I'll endeavour to turn fixes round as fast as possible, just please allow us some time given the season, holiday periods, etc. I'm aware of one niggle already and I'm certain there are others.

Enjoy, and happy holidays.

Posted by ldodds at 08:23 PM | Feedback? | TrackBack

December 13, 2004

FOAF-a-Matic now available in Dutch and Chinese

I'm pleased to announce that the FOAF-a-Matic is now available in two additional languages: Traditional Chinese and Dutch. (Those are the direct links, content negotation will be used to give you a suitable version automatically if you use the usual link).

I actually got sent two Chinese translations so I should thank both Ilya Eric Lee and Chientai Chen. Ilya's was the first I received so I opted for a "first-come-first-published" policy and thats the translation that is currently live.

I'd also like to thank Ben Dunselman for contributing the Dutch translation.

(I should also apologise to all three for the length of time it's taken for me to get these live.)

So that brings the grand total up to 12 languages. Excellent. In case anyone is taking requests I'd really like to see Russian and Arabic translations.

Posted by ldodds at 08:40 PM | Feedback? | TrackBack

December 09, 2004

Slug: A Simple Semantic Web Crawler

Back in March I was tinkering with writing a Scutter. I'd never written a web crawler before, so was itching to give it a go as a side project. I decided to call it Slug because I was pretty sure it'd end up being a slow and probably icky; crafting a decent web crawler is an art in itself.

I got as far as putting together a basic framework that did the essential stuff: reading a scutter plan, fetching the documents using multi-threaded workers, etc. But I ended up getting sucked into a work project that ate up all my time so didn't get much further with it.

Anyway, because the world is obviously sorely in need of another half-finished Scutter implementation, I've spent a few hours this evening tidying up some of the code so that it's suitable for sharing.

If you're just in interested in the code, then lets get the links out of the way first:

The code is published under a Creative Commons Attribution-ShareALike licence.

To run the code using the supplied batch file (sorry, don't have access to a *nix box at the moment to add a shell script) do the following from the directory into which you unpack the zip:


slug -mem memory.rdf -workers 10 -plan sample-plan.rdf

This will kick off a scutter with 10 worker threads, as well as telling it where to find its memory and new scutter plan.

As Slug is basically a prototype it doesn't do anything clever with what it finds. It simply GETs every URL from its RDF scutter plan, writes a copy of the original RDF file to filesystem, which it then parses with Jena to find any seeAlso's. The new URLs it finds as a result are then added to its ongoing list of tasks. And so on ad infinitum: it'll just keep on sliming its way across the semantic web until you kill it. You can merrily Ctrl-C the process as there's a shutdown hook registered that'll ensure the process tidies up after itself.

The reason it doesn't add the triples directly to a triple store is because I wanted to be able to collect a chunk of RDF files locally for processing in different ways, e.g. to test out smushing algorithms, look for common authoring mistakes, etc. By default these files are stored in a slug-cache directory under your user home -- but you can override that with the -cache parameter.

The one novel thing it does do (at least as far as I'm aware) is to use the ScutterVocab to record what it did when. This is what gets stored in the memory. Here's an extract from the example included in the distribution:


<scutter:Representation>
<scutter:source rdf:resource="http://heddley.com/edd/foaf.rdf"/>
<scutter:origin rdf:resource="http://ldodds.com/ldodds-knows.rdf"/>
<scutter:origin rdf:resource="http://rdfweb.org/people/danbri/rdfweb/webwho.xrdf"/>
<scutter:origin rdf:resource="http://www.simonstl.com/foaf.rdf"/>
<scutter:fetch>
<scutter:Fetch>
<dc:date>2004-12-09T21:57:03+0000</dc:date>
<scutter:status>200</scutter:status>
<scutter:contentType>application/rdf+xml</scutter:contentType>
<scutter:lastModified>Mon, 05 Jul 2004 13:52:28 GMT</scutter:lastModified>
<scutter:etag>"adc416-2741-40e95d1c"</scutter:etag>
<scutter:rawTripleCount>164</scutter:rawTripleCount>
</scutter:Fetch>
</scutter:fetch>
<scutter:localCopy>...\ldodds\slug-cache\heddley.com\edd\foaf.rdf</scutter:localCopy>

<scutter:origin rdf:resource="http://www.ldodds.com/ldodds-knows.rdf"/>
<scutter:origin rdf:resource="http://eikeon.com/foaf.rdf"/>
<scutter:fetch rdf:nodeID="A164"/>
<scutter:latestFetch rdf:nodeID="A164"/>
<scutter:origin rdf:resource="http://www.grorg.org/dean/foaf.rdf"/>
<scutter:origin rdf:resource="http://www.wachob.com/foaf.rdf"/>
<scutter:origin rdf:resource="http://weblog.greenpeace.org/foaf.rdf"/>
<scutter:origin rdf:resource="http://chimpen.com/foaf.rdf"/>
...
</scutter:Representation>

The source property indicates the source URL of the Representation, and the origin properties indicating references to it from elsewhere.

The Scutter stores the results of its its GET in a Fetch resource that includes details such as date of fetch, HTTP response codes, Last-Modified and ETag headers (Slug supports Conditional-GET behaviour), and the number of triples in the file. If Slug encountered an error then a Reason is recorded too -- it'll also avoid refetching that URL again. See the ScutterVocab page for more details.

Thats pretty much it. No fancy crawling strategires, no loop detection, no cleaver handling of HTML responses to look for referenced metadata, and no LiveJournal avoidance tactics. If you want to do something more clever with it though, then the framework is reasonably extensible:

For example if you want to put the triples directly into a triple store, then just add a new Consumer implementation. The DelegatingConsumerImpl I'm already using can create a simple pipeline for handling results of a GET.

Or if you want to add on a user interface then there are hooks for that too, look at the Controller and Monitor interfaces. There are methods there for monitoring how many threads are active, and dynamically adjusting the number of workers.

But if you're just interested in analysing links between resources on the semantic web, getting estimates of numbers of triples, or analysing the RDF that's out there to look for common authoring mistakes, etc then just collecting data in Slug's memory and offline cache may be sufficient for your needs.

Anyway, if you do find this useful, or want help getting it up and running and/or integrated into your own applications then please feel free to get in touch.

At the moment I'm noodling with an alternate version which uses asynchronous messaging using JMS as the basic Scutter kernel. Matt Biddulph's Crawling the Semantic Web paper mentions using asynchronous messaging to provider co-ordination between a Scutter and application interested in RDF data, so I may hack a crack at something in that vein.

Posted by ldodds at 11:38 PM | Feedback? | TrackBack