Development and publication of XML vocabularies, is simply web publishing. Therefore adapt web publishing tools and techniques to this area.

Problem Statement

  • No search engines equivalent to Google
  • No directories
  • Lots of schema repositories, but no comparison of vocabularies produced by disjointed communities.
  • Lack of editorial input in vocabulary descriptions
  • Statistics of usage -- is the vocabulary declining in use or growing? Is there room for a Netcraft equivalent?

Good news is that most of the info is available, hence this project to tie them together.

Data Model

RDF. Two classes: Document and Namespace, number of relations:

  • quotes -- quoted in document, but not declared or used
  • declares -- as per NS spec
  • uses -- used, but not as root
  • usesAsRoot -- uses as root element, and possibly others

In future will add: isSchemaFor and isTransformationFor

Small number of properties: Document has only 2 (wellformedxml and lastvisit), Namespace has none.

Simple RDF description (Schema), no containers, properties included directly in their classes.

Information Gathering

C++ open source crawler called Larbin

Handles link detection and duplicates management. Call backs when page has been retrieved.

Docs first processed as text, with regexps spotting xmlns, so can be used if not well-formed. Then test parse done with libxml, if successful then a finer analysis done with an XSLT using libxslt.

When NS found, RDF document is generated and stored with the RDF description of the document and the relations to the namespaces discovered in the document.

Docs can then be loaded in an RDF database.

Because can handle non well-formed, can even spot namespaces in mailing list archives (e.g. XML-DEV)

Note: backlinks from XML Schema documents? Standard means for publishing these? Can't all be captured by a web crawler.

RDF Database

No description of namespace, batch needs to be done to add types to the documents.

Uses 4Suite and Versa

Don't add description directly to avoid redundancy...?

Statistics

Only very small sample used initially.

  • Overall stats and Top 10 Namespaces

Very few are namespace aware (~3%). Of these only 1% are well-formed.

XHTML 1.0 (2.1%), MS Office (0.3%), HTML 4.0, VML, RDF, XLink, MS Word, XSLT, Saxon, Uuid (another MS Office namespace; 0.1%)

  • Details Statistics
  1. Only half of XHTML docs quote NS -- high because docs aren't well-formed.
  2. None of the MS Office documents are well-formed.
  3. HTML 4.0 namespace was often referred to before publication of XHTML 1.0 -- namespace leaks (not excluded in XSLT) so can determine how a document was constructed.
  4. RDF stuff mainly leaks

Would be useful to do this to track usage.

Directory

Could publish data as a directory, e.g. as RDDL.

Could usefully adorn the basic stats, with human commentary, or additional resources such as stylesheets, schemas, etc.

Note: there's a lot of XML which won't necessarily be harvestable in this way. E.g SOAP, OAI-MHP

Could also add news, e.g. xmlhack XLink channel.

Standard search engine would give useful results without additional work.

Could use a Topic Map to relate resources, perhaps leveraging the OASIS XMLVoc TC work.

Add new attachment

In order to upload a new attachment to this page, please use the following box to find the file, then click on “Upload”.
« This page (revision-1) was last changed on 21-Aug-2002 18:24 by unknown [RSS]
G’day (anonymous guest) My Prefs


Referenced by
XMLEurope2002

JSPWiki v2.6.0 [RSS]