Nicholas Carroll Doug Engelbart's Open Hyperdocument System proposes Dynamic Knowledge Repositories (DKRs). Among other features, DKRs will need powerful information retrieval capabilities. (Since someone will abbreviate this process by the end of the week, we might as well do so right now, and call it DKR/IR.) The explosion of the World Wide Web has one hopes taught most thoughtful observers something that was known to information science people decades ago: information retrieval does not scale well. Increasing a search base by an order of magnitude can create a whole new game. In fact, decreasing a search base by an order of magnitude can create a whole new game. Over the last few years of ecommerce web site architecture I spent several thousand hours studying electronic search from various directions how search software works, how users think, and how to structure search logic and data structures to meet users in the middle. While my mathematics are occasionally naive, these thoughts come from fairly hard lessons in what works and what does not. Why am I writing this? There will be several small papers flowing off this one. Frankly, I don't expect many OHS developers to read them all. The point of the exercise is to get IR requirements on the radar screen, for those who are doing deep thinking about lower-level architecture. When finished, I'll write a synopsis. (Note: "information retrieval," as IR people use it, means finding information. Actually retrieving the information onto your computer screen is called "information access." An unfortunate choice of terms, but there we are.)
The Basis of WWW Information Retrieval The Basis of WWW Information RetrievalVery recently, in a galaxy which now seems far away, hightech startup companies were going to make the entire Internet searchable with "95% relevance." Or was it 98%? In either case, the standard claim was "We're already at 70%!" Then they hit what a leading information science researcher once called "the slippery slope to artificial intelligence." Algorithms (the word has lost most meaning in the last few years) lost much of their sheen. Natural Language Processing (NLP) is likewise tarnished, and most NLP operations now have a few dozen workers in the back room, tweaking searches that the computers could not understand. Likewise for the extremely lowbrow AI of autoresponders the more astute organizations have discovered that the proper response to "Stop sending catalogs!" is not "Thank you! Your catalog is on the way!" Present Search ToolsSearch tools existed long before the Internet spawned the WWW. Of the major types hierarchies, keywords, and databases it is the first two that have appeared most often on the Web. And since the Web is familiar to most readers, I'll start there:Hierarchies (directories) Keywords (search engines) Database searches have not made a big entry to the web. I imagine this is because a) it requires at least a little user sophistication to write a query, no matter how good the interface is, and b) web sites are typically built with a mixture of design, Perl, and Java skills; when database knowledge is involved, it is usually buried at the back end. (Note: many search engines use tabling methods to speed search, but there is no database mindset at the user level.) Current Incarnations On the WebToday we can see several variants: Companies that kept hammering at the mathematics, Google.com probably being the most successful. Those who turned their backs on the mathematics, and have been turning to the manipulation of the data structures, indexing, and cataloging, variously calling it "content synthesis," "knowledge integration," and other terms. These are typically startups subcontracting to major organizations. AskJeeves.com is the most visible of the consumer-oriented ones. A third group moved towards established data schemes, adopting classification systems such as Library of Congress cataloging. I have been through this, and it rarely works. Catalogers become rigid over time, and gradually fall out of touch with users. Amazon.com relies heavily on catalogers, and their search suffers heavily as a result. A fourth and much smaller group has synthesized mathematics with their own sorting and cataloging methods. NorthernLight.com is a leading example. A fifth group of more complex mutations, which I won't be dealing with right now, includes search algorithms, editors, user voting, and subject clumping schemes such as "themes" or topic maps. There are dozens (hundreds?) of these. Alexa.com and Oingo.com are two of the more prominent. Note: most directories now have a search engine tacked on as an additional resource, and likewise most search engines index the ODP for its more precise cataloging.The Smaller DKRIn structuring DKR information retrieval, I favor a variant of the fourth approach: synthesizing good algorithms with good data structures and then allowing some user access to both. Some opinions about the path to quality DKR/IR:
The Layers That Affect Information RetrievalMoving towards the point, this the first of a few short papers on retrieving information from DKRs. This "layers" description is a tops-down user perspective; it does not precisely reflect software architecture. I'm addressing these in no particular order. Finished papers are linked.
ReferencesToward High-Performance Organizations: A Strategic Role for Groupware, Douglas C. Engelbart, June 1992 The Invisible Substrate of Information Science, Marcia J. Bates Human, Database, and Domain Factors in Content Indexing and Access to Digital Libraries and the Internet, Marcia J. Bates The Memory Palace of Matteo Ricci, Jonathan Spence Retrieval Structure Manipulations Name Spaces As Tools for Integrating the Operating System Rather Than As Ends in Themselves, Hans Reiser |
Please send comments to Nicholas Carroll Email: ncarroll@hastingsresearch.com Keywords: Open Hyperdocument System, OHS, Bootstrap, Doug, Douglas, Engelbart, Augment. http://www.hastingsresearch.com/net/02-dkr-ir-intro.shtml © 1999-2002 Hastings Research, Inc. All rights reserved. |