Development and publication of XML vocabularies, is simply web publishing. Therefore adapt web publishing tools and techniques to this area.
Problem Statement
- No search engines equivalent to Google
- No directories
- Lots of schema repositories, but no comparison of vocabularies produced by disjointed communities.
- Lack of editorial input in vocabulary descriptions
- Statistics of usage -- is the vocabulary declining in use or growing? Is there room for a Netcraft equivalent?
Good news is that most of the info is available, hence this project to tie them together.
Data Model
RDF. Two classes: Document and Namespace, number of relations:
- quotes -- quoted in document, but not declared or used
- declares -- as per NS spec
- uses -- used, but not as root
- usesAsRoot -- uses as root element, and possibly others
In future will add: isSchemaFor and isTransformationFor
Small number of properties: Document has only 2 (wellformedxml and lastvisit), Namespace has none.
Simple RDF description (Schema), no containers, properties included directly in their classes.
Information Gathering
C++ open source crawler called Larbin
Handles link detection and duplicates management. Call backs when page has been retrieved.
Docs first processed as text, with regexps spotting xmlns, so can be used if not well-formed. Then test parse done with libxml, if successful then a finer analysis done with an XSLT using libxslt.
When NS found, RDF document is generated and stored with the RDF description of the document and the relations to the namespaces discovered in the document.
Docs can then be loaded in an RDF database.
Because can handle non well-formed, can even spot namespaces in mailing list archives (e.g. XML-DEV)
Note: backlinks from XML Schema documents? Standard means for publishing these? Can't all be captured by a web crawler.
RDF Database
No description of namespace, batch needs to be done to add types to the documents.
Uses 4Suite and Versa
Don't add description directly to avoid redundancy...?
Statistics
Only very small sample used initially.
- Overall stats and Top 10 Namespaces
Very few are namespace aware (~3%). Of these only 1% are well-formed.
XHTML 1.0 (2.1%), MS Office (0.3%), HTML 4.0, VML, RDF, XLink, MS Word, XSLT, Saxon, Uuid (another MS Office namespace; 0.1%)
- Details Statistics
- Only half of XHTML docs quote NS -- high because docs aren't well-formed.
- None of the MS Office documents are well-formed.
- HTML 4.0 namespace was often referred to before publication of XHTML 1.0 -- namespace leaks (not excluded in XSLT) so can determine how a document was constructed.
- RDF stuff mainly leaks
Would be useful to do this to track usage.
Directory
Could publish data as a directory, e.g. as RDDL.
Could usefully adorn the basic stats, with human commentary, or additional resources such as stylesheets, schemas, etc.
Note: there's a lot of XML which won't necessarily be harvestable in this way. E.g SOAP, OAI-MHP
Could also add news, e.g. xmlhack XLink channel.
Search
Standard search engine would give useful results without additional work.
Could use a Topic Map to relate resources, perhaps leveraging the OASIS XMLVoc TC work.
