Kluwers new approach to creating back of book indexes

Some comments on standards, tools.

Still mainly SGML shop.

Legal information review: statutes/sections; jurisprudence; journal articles; books and loose leaf publications (sectionwise commentary)

Classical indexing method

  • one indexer generates entire book
  • no restriction on choice of terms
  • limited reuse of terms
  • no restrictions placed by system
  • no restriction on the internal relations in an index
  • see also references

Result: no way to join indexes, as needed when publishing electronically (CD ROM, Internet)

Solutions:

Could use Topic Maps, but didn't use them because they're most applicable when no control over the indexes.

In this situation can adopt centralise approach instead.

  • Central word list
  • Relations in central list cause 'see' and 'see also' references to be generated
  • Thesaurus
  • Terms can be combined to string.

General thesaurus relationships: narrower, broader, etc.

Keyword strings

often single term is not enough, multiple terms is more accessible to user.

  • main term -- from thesaurus
  • zero or more connected strings
    • connector (by, of)
    • keyword string -- nesting of keyword string

connected string is logically connected to main term through connector.

New concept

  • build thesaurus of related terms
  • build keyword string
  • link to information objects
  • make an index to set of objects, extract all linked keyword strings
  • now have all used terms, and hence search entries, and related search entries
  • duplicate under each keyword, each keyword string.

Result is a Keyword Out of Context index (KWOC)

Pros

  • automatic index generation
  • index merging
  • reuse of terms and kw strings
  • reuse of indexes (regeneration)
  • cost effective when used frequently (ROI)

Cons

  • high initial costs -- thesauri and kw strings
  • ordered, nested strings complicate the system
  • indexer feels restricted
  • only applicable with tool, 'string mgt. system'

Mgt System

  • link occurence to kw string
  • create new kw strings
  • add terms
  • add relations between terms
  • editorial approach for terms and relations (states, workflows)
  • simultaneous editing
  • granular locking
  • standards, minimise programming overhead

Need editorial tool because relationships are complex and verbose, etc, etc.

Better approach than adding markup to content is separate out different aspects.

Thesaurus Model

Problems with usual hiearchical XML model for thesauri, need multiple parallel hierarchies. Can be done, but very verbose (have to duplicate structures).

Instead use flattened structure with ID/IDREF constraints: normalized model that's very close to DB schema.

When it comes to extend the model, problems with DTD because of limitations of ID/IDREF contraints -- db better here

Keyword String Model

Again, normalization lead to using ID/IDREFs

Need ordering in strings, requires adding sequences and ORDER BY. Insertions then become trickier. As does retrieval: recursive SQL statement.

XML superior for nested, ordered data.

KWOC Index

'classic' DTD

Content not maintained directly: generated from other 2 models.

5 Ways to Implement

  • db

Hard for hierarchy/order to be graspec by relational db developer.
Complex programming for recursion
Tricky to produce useable interface

But best for modelling system

  • XML

Modelling issues -- integrity constraints
No granular locking (at least in deployed system)
ID/IDREFs broken when checking out fragments, therefore need additional layer

  • hybrid db and XML

Needs lots of tuning of XML editor to interface with db.

  • ISO TM
  • XTM

Returned to Topic Map model.

Several layers: model, thesaurus, kw strings, object, linking layer

End up with same model at the end.

Needed TMQL (query language) and TMCL (constraint language) but not yet developed so couldn't have a fully standards compliant app.

ISO TMs allow meaningful element names
XTM only topic elements, same problem with nesting/order as db.

Conclusions

db superior for avoiding redundancy, but mismatch with XML tools, and not good at hierarchy/order.

Mixing hierarchy and relations means no simple application in either SQL or XML. Will always be more complex solution.

XML with element locking might alleviate some problems, but redundancy/normalisation was the killer.

mailto:dgerth@kluwer.nl

Add new attachment

In order to upload a new attachment to this page, please use the following box to find the file, then click on “Upload”.
« This page (revision-1) was last changed on 21-Aug-2002 18:23 by unknown [RSS]
G’day (anonymous guest) My Prefs


Referenced by
XMLEurope2002

JSPWiki v2.6.0 [RSS]