Streams, Pools and Reservoirs

I often find it useful to try on different metaphors for application design and architecture. The stock set of patterns that are used in a particular domain are always very useful for communicating the design and intent of a complex application, but I find that experimenting with different approaches is a useful exercise that often helps sparks a bit of creativity.
I think this is increasingly true at the moment as when I look around at some of the ongoing debates it seems we’re trying to grind more and more meaning out of the same terms and concepts, arguing over finer points of definitions, or coining some fairly horrible neologisms; “prosumer” and (I must admit) GGG spring to mind.

Trying out more intuitive concepts can also help when the discussion is not confined to a technical audience.

Anyway, after digging through some old notes to dig out some work on did on Scutter (web crawler) configuration for Danny and Sam, I came across the following which I thought I’d post to clear out the ‘ole blog tumbleweeds.

The following attempts to provide a more natural (in the strictest sense) and dynamic conceptualization of web crawling and data aggregation. In this model there are three basic concepts: Streams, Pools and Reservoirs.

Stream

A stream is a flow of data. A stream may be generated by a web crawler (i.e. a Scutter) by crawling a number of sources; it may be generated by the originating site(s) itself, e.g. as a syndication feed. A stream may also be created by the collective users of a system.

Streams are therefore the channels along which data flows.

Pool

A pool is a collection of data that is fed by one or more Streams. Just like a mill-pond is created for a specific purpose, a pool is generally created to fulfill a specific need, driving a single application.

A pool may be a permanent feature of the information landscape or it may be more ephemeral. For example a search engine or blog aggregator creates a permanent pool of documents, whereas a web cache, or the working set of data inside an interactive application like the Tabulator, is a temporary data Pool.

Reservoir

A reservoir is a larger Pool of data. Like a pool, a reservoir is fed by streams. But unlike a pool, a reservoir supports a community: the data it contains may be used for a number of different purposes. The community gains the benefit of the shared infrastructure which holds the data, feeds the reservoir, and provides access to the surrounding community.

I think most existing collections of data on the web fall into the “Pool” category, For example, for me, Flickr is a Pool because its primarily driving a single main application; and also because the data it contains is limited to a specific domain (people and photos). Flickr obviously also generates Streams of data (e.g. RSS feeds, or API accesses) which drive other applications. But it would be safe to say that most Flickr apps and, in my opinion, most Web 2.0 apps, are “Stream-oriented” in that they tend to tap into the data flows but don’t create their own pools (at least few that are permanently available).

There are also some Reservoirs. The one that is the closest fit for how I’ve been thinking about this is the Talis platform. That platform provides individuals with the ability to create and maintain their own Pools of data, while the means to access all of the public data stored in these Pools makes the platform a Reservoir of useful information upon which a variety of applications can draw. To stretch the metaphor to its fullest, the Talis platform provides some useful internet plumbing.

Services like Yahoo Pipes offer other complementary functionality but the storage capability is an important one.

One area that I think is under-specified at the moment is how to describe how to construct a Pool, i.e. how to create or tap into a number of Streams, and how to process those streams in order to create a data aggregation. Which brings me back to those notes I was hunting for. More on that another time.