 |
| |
The first wave of the dot-com revolution failed for numerous reasons.
Greed, arrogance, and ignorance were among the human failings. The dot-coms failed
equally in technical execution, and perhaps the largest single technical failing was
poor information retrieval: people simply couldn’t find what they wanted. In this
paper Marcia J. Bates of UCLA’s information science department delineates some of those
errors. Nicholas Carroll
After the Dot-Bomb: Getting Information Retrieval Right This Time
|
|
Marcia J. Bates, UCLA Department of Information Studies
Date: August 5, 2002
Modified: N/A
Introduction
Using Old-Fashioned Hierarchical Classifications
Succumbing To The "Ontology" Fallacy
Using Standard Dictionaries Or Roget’s Thesaurus For Information Retrieval
Ignoring the Bradford Distribution
Ignoring Size-Sensitivity of Information Retrieval Databases
Getting Human Content Processing Wrong
Ignoring Information Expertise
Conclusion
Introduction
At the height of the 1990’s information technology bubble, an information broker,
researching a question for a client, called me and explained that her client was having a
dispute with another dot-com company over which company had been the first to invent the
idea of "push" technology, i.e., automatically sending information to people in
interest areas they had designated in advance. The goal of the query was to determine that
no third party had had the idea earlier.
I explained to the broker that the idea of "push" technology was first called
"selective dissemination of information," or SDI, and, to my knowledge, had first
been proposed in 1961 yes, 1961 in an article in the journal American
Documentation by an IBM computer scientist by the name of H.P. Luhn (1961). He worked
out the idea in considerable detail; the only key difference was that the old mainframe
computer would spit out informative postcards to be mailed to customers, rather than sending
the information online since there was no "online" to use in those days.
I have had many experiences like this one since the Internet burst on the scene in
the 1990's. I have watched as hundreds of millions of dollars have been invested to
re-invent the wheel often badly. Everybody understands and takes for granted that
there is an expertise needed for the application and use of technology. Unfortunately,
many Web entrepreneurs fail to recognize that there is a parallel expertise needed about
information collecting it, organizing it, embedding it successfully in information
systems, presenting it intelligently in interfaces, and providing search capabilities that
effectively exploit the statistical characteristics of information and human information
seeking behavior.
"Content" has been treated like a kind of soup that "content providers"
scoop out of pots and dump wholesale into information systems. But it does not work that way.
Good information retrieval design requires just as much expertise about information and
systems of information organization as it does about the technical aspects of systems.
It is also the case that a lot of what one naturally assumes about how people need,
search for, and retrieve information, is wrong the truth is counter-intuitive.
How people cope with not knowing, with trying to find out, and how they use information
resources, is a complicated and subtle business (Bates, 1998). Likewise, the good information
system solutions for enabling people to search and retrieve information effectively are also
counter-intuitive. Good systems don’t work the way one would assume. Had the
dot-com businesses consulted the research in information science on SDI, they would
have learned that SDI was largely unsuccessful, except in certain specific situations
(Packer and Soergel, 1979). It comes as no surprise then, that "push" technology
has also largely failed.
In the Internet gold rush, the Web entrepreneurs and the venture capitalists who funded
them all had the same conventional and mistaken ideas about how information
retrieval works. So they made a wonderful match. The company founders and their financial
backers shared a vision for their Internet companies that was wrong-headed and
unproductive in many ways, but, crucially, was the same vision.
In fact, there was an "information industry" already around decades before
the 1990’s. These were the companies that developed and published giant databases
of patent and legal information, biological, chemical and other scientific and humanities
information resources, newspaper and government information databases, etc. Companies with
names like Chemical Abstracts, Infotrac, Inspec, Lexis-Nexis, and Engineering Index.
These organizations had learned the hard way about information systems and information
retrieval. When the dot-com newcomers came along in the 1990's, the established companies
were not about to give away their hard-earned knowledge to the new kids on the block.
The new companies probably would not have listened anyway.
Likewise, in the 1970’s and 1980’s, librarians had also created
multi-million-item online public access library catalogs, when online access was a
brand-new concept, and had developed a tremendous amount of expertise about how to handle
large, messy databases of textual information. In fact, the largest of these catalog databases,
the Online Computer Library Center’s "WorldCat" database holds over
47 million records from 41,000 libraries world wide
(http://www.oclc.org/about/).
Yet it has been almost an article of faith in the Internet culture that librarians have
nothing to contribute to this new age.
This author has been researching and consulting in information retrieval system design
for decades (see http://www.gseis.ucla.edu/faculty/bates/).
Described below are some "pet peeves," some problem areas identified in the design
of Web information retrieval to date. These problems are accompanied by suggested solutions,
or, at least, directions to go in to develop solutions for the next round of Web information
retrieval development.
Using old-fashioned hierarchical classifications
When classifications are used in Internet databases, it is hierarchical classifications
that are almost invariably used. These are in the conventional "tree" shape, a broad
area subdivided, then subdivided again and again, with each possible category contained
within the one above. Librarians invented a better kind of classification decades ago, that
is called faceted classification. It is too involved to explain in this brief article, but a
good analogy is to say that faceted classification is to hierarchical classification as
relational databases are to hierarchical databases. Most system designers would not
dream of using hierarchical files these days, so why are hierarchical classifications of
information content still being used?
Librarians implemented some faceted classifications during the twentieth century, but the
technology to exploit faceting fully for online systems has become available only recently.
Thus the theory as described in information organization textbooks is generally not fully
adapted to the new technology, but easily can be. See, for example, Rowley and Farrow (2000)
or Ellis and Vasconcelos (1999). A brief comparison of the two types of classification schemes
is provided in Bates (1988).
Succumbing To The "Ontology" Fallacy
The hot new term in information organization is "ontology." Everybody’s
inventing, and writing about, ontologies, which are classifications, lists of indexing terms,
or concept term clusters (Communications of the ACM, 2002). But here’s the problem:
"Ontology" is a term taken from philosophy; it refers to the philosophical issues
surrounding the nature of being. If you name a classification or vocabulary an
"ontology" then that says to the world that you believe that you are describing
the world as it truly is, in its essence, that you have found the universe’s one true
nature and organization. But, in fact, we do not actually know how things "really"
are. Put ten classificationists (people who devise classifications) in a room together and
you will have ten views on how the world is organized.
Librarians had to abandon this "one true way" approach to classification in the
early twentieth century. As many are (re-)discovering today, information indexing and
description need to be adjusted and adapted to a myriad of different circumstances. Why,
then, use the misleading term "ontology"?
Apart from philosophical issues, there is another, more important reason to abandon use of
the term. Recorded information does not work the same way the natural world does. Information
is a representation of something else. A book, or a Web site, can mix and match informational
topics any way its developer feels like doing. There’s no such thing as a creature that
is half squirrel and half cat, but there are many mixes of half-squirrel/half-cat
topics in information resources and Web sites. Methods of information indexing have to
recognize what’s distinctive to information, as opposed to classifications of nature,
and design the systems accordingly.
For example, one fan of the poet Emily Dickinson creates a Web site that contains a
one-paragraph biography of her, along with a list of every poem she ever wrote. Another
fan of Dickinson devotes his site to images of the house and community where Dickinson lived.
Still another has collected a bibliography of every book or article written about Dickinson
and her poetry. Elsewhere on the Web are sites that group Dickinson with other nineteenth
century poets or other women poets or other American poets. Beyond just using her name, how
can these sites be usefully indexed so people can find the angle they want to explore about
the poet?
Long-term solutions to the problems of indexing the Web will probably involve multiple
overlapping methods of classifying and indexing knowledge, so that people coming from every
possible angle can find their way to resources that work best for them. Instead of calling it
an "ontology," label the system of description what it really is a
classification, thesaurus, set of concept clusters, or whatever (see also Soergel, 1999.).
Using Standard Dictionaries Or Roget’s Thesaurus For Information Retrieval
These days, many information retrieval research experiments and commercial applications
are being developed that are based on the sensible-seeming assumption that if we want
people to be able to retrieve text, we should build into the system a standard dictionary,
or a Roget’s-type thesaurus (Bartlett’s Roget’s Thesaurus, 1996), or
an experimental mapping of vocabulary such as Wordnet
(http://www.cogsci.princeton.edu/~wn/).
Linguists are particularly prone to this fallacy. Linguists know the most about languages,
and so they assume, quite reasonably, that they should make the decisions about what
linguistic resources to use for information retrieval experiments.
However, linguists are not experts in information retrieval. Through decades of
experimentation, the IR community has learned how ineffectual such conventional dictionary
and thesaurus sources are for real-world information retrieval. Instead, another type of
thesaurus has been developed specifically for the case where the purpose is information
description for later retrieval. These IR thesauri number in the hundreds, and have been
developed for virtually every kind of subject matter. Many "ontologists" are truly
re-inventing the wheel an already-developed thesaurus for their subject matter
may be hiding in the stacks of their local university library.
Information retrieval thesauri have a different internal logical structure, and contain
words and phrases that are designed to be effective in information retrieval. Take a look at
any one of these IR thesauri, and the differences from basic dictionaries and Roget’s
will be immediately evident. Examples:
- Art and Architecture Thesaurus, (1994)
- Ei Thesaurus (engineering), (1992- )
- Legislative Indexing Vocabulary, (Loo, 1998)
- Los Angeles Times Thesaurus (news), (1987- )
- Thesaurus of Psychological Indexing Terms, (2001)
There is also a thesaurus for use with free-text searching, where there may or may not
be formal indexing vocabulary ("controlled vocabulary") assigned to the records.
Knapp’s The Contemporary Thesaurus of Search Terms and Synonyms (2000) was
developed by a search expert over decades of experience, and with lots of input from other
searchers.
These are the kinds of thesauri that should be used to index and retrieve from online
information resources.
Ignoring the Bradford Distribution
We might call this the "You can’t mess with Mother Nature" Principle.
As they grow in size, databases and other bodies of information follow something called the
Bradford Distribution pretty much no matter what you do. In other words, all sorts of
things related to information do not conform to the standard Gaussian or "normal"
distribution, but rather to the Bradford Distribution. Frequencies of popular queries to a
Web search engine, rates of assignment of indexing terms or classification categories to
documents or sites, sizes of retrieval sets, etc., all conform to Bradford.
There are numerous sources that will explain the mathematics of Bradford (Bookstein, 1990;
Brookes, 1977; Chen and Leimkuhler, 1986). In ordinary language, Bradford distributions do not
have the conventional bulge in the middle, but instead have very long tails. For instance,
typically there will be a few topics that are requested by huge numbers of people at a Web
site, and a huge number of topics requested very little or not at all. Likewise, for retrieval
sets, instead of most of the retrievals containing a middling number of "hits,"
some will contain huge numbers of hits and others few, with not very many retrievals
producing middling numbers.
This Bradford distribution (related to the "Pareto Distribution" in economics)
is extremely robust, and virtually impossible to defeat. Systems have to be designed to
work with the Bradford Distribution, rather than trying to fight it. See discussion
in Bates (1998) and references within those articles.
Ignoring Size-Sensitivity of Information Retrieval Databases
Every type of indexing vocabulary or classification has an explicit or implicit structure,
and that structure works well with only certain sized databases. That cute little classification
scheme you devise when you have 1,000 records will be driving you crazy with its inadequacies
by the time there are 10,000 records. The indexing vocabulary that was good for a
one-million-item database bogs down at five million items and so on.
The long-term history of libraries and online databases reflects this
size-sensitivity problem in slow motion, as it were. Library cataloging systems that
worked well in the early nineteenth century have had to be drastically modified every few
decades since then to deal with the consequences of growth in the resource base.
After World War II, when scientific research was growing rapidly, and scientific literature
was exploding in quantity, whole new systems of information access had to be devised to handle
it that is, new intellectual systems, not only technological improvements.
On the Web, this explosion in growth is happening in months, not years or decades.
The smart information developer must anticipate growth from the beginning and design for all
planned size levels of the database from the beginning. Otherwise, you are always scrambling
and always behind the curve. I have repeatedly seen dot-coms assume that they can start
with some simple little classification or index vocabulary and worry about growth later.
The trouble is, the growth comes in a few weeks! By then, a commitment has been made to the
earlier, small system. No one wants to re-index existing records, yet the fuller
development or modification of the indexing system requires re-indexing for clarity for users
and indexers alike. Eventually, the classification or other metadata system becomes a
hodgepodge of work-arounds and bad solutions; see also Bates (1998).
Often, the chief product a company has to offer to its Web site users is some form of
indexed information. Yet, figuring out how to optimize the indexing and retrieval of that
information is the last thing that is attended to during the ramp-up to going online. If you
believe your information resource will grow, then design for growth from the beginning.
Otherwise, trust me: It will get worse.
Getting Human Content Processing Wrong
Do you want to keep human indexing costs down? Then pay attention to the design of the
indexing support system. Many Web sites today offer information that is in some part indexed
or categorized by human beings. Needless to say, this is the most expensive part of many
operations, and the point where efficiency produces the highest payoff. Efforts to improve
processing efficiency may be limited simply to pressuring indexers to work faster. But more
can be done than that to help the human indexers.
Think of the indexing support system as a separate information system, with its own
requirements and users. What the indexers need in order to find their way around your system
of indexing vocabulary or categories is different from what the system end users need to find
their way efficiently around a body of information. Often it is the indexing support software
that makes indexers inefficient, not the people themselves. It is important to keep the
cognitive processing load of the indexers moderate, so that they are neither bored nor feel
in overload. That, in turn, requires segmenting the indexing process into easily manageable
parts, with support from the indexing system at key points.
For example, suppose you have a 5,000-term indexing vocabulary. Instead of just listing
it in alphabetical order for your staff, create groups of related terms (concept clusters) on
broad-concept screens. Then, instead of having to move back and forth through a lengthy
alphabetical listing, the indexer can see, on the screen at once, all the terms likely to be
relevant to the record in hand. Now the indexer does not have to think up half a dozen
possible terms, and look them each up separately to identify the best one, but can instead,
at a glance, determine the right term and assign it quickly. In sum, study indexing itself as
a process, with "users" (indexers) who need to be accommodated for best performance
and satisfaction, then design a targeted indexing support system.
Ignoring Information Expertise
Many Web companies, in the process of developing an information-providing site, assemble
a powerful team of technology, content, and graphic design experts. Programmers come from the
top schools and companies, Ph.D. experts in the subject matter are brought on board, and top
graphic designers are hired to present a gorgeous interface to the Web user. But even though
the purpose of the site is to present information, or "content," to the world, no
one who knows anything about information is brought on board.
Understanding content is not the same as understanding information; See also Bates (1999).
The information specialist the person who creates classifications, designs metadata
protocols, crafts search capabilities for information system users, designs systems
specifically for information retrieval that person has an entirely different expertise
than either the content expert or the programmer. All these individuals have to work together
to produce a good system; see also Bates (2002). But if the information expert is left out,
the resulting system will be good in every way except at providing information!
Conclusion
In sum, the following improvements in Web information retrieval design are recommended for
the process of recovering from the late-1990's "dot-bomb":
- Use faceted classifications, rather than hierarchical.
- Develop an understanding of what distinguishes information classifications and
vocabularies from the physical-world equivalents, and stop using the misleading term
"ontology."
- Use the many vocabularies specifically designed for information retrieval, rather than
general English language vocabularies.
- Understand and work with the underlying statistical characteristics of information in
designing information retrieval. Failing to understand these factors simply leads to
sub-optimal systems.
- Recognize that systems of information description are extremely size-sensitive.
Design for all anticipated database size ranges from the beginning.
- Be kind to your indexers: Design a targeted indexing-support system specifically for
your human staff, and you will save much staff time.
- If you develop a site with any information retrieval component at all, then hire
information expertise.
Inevitably, when authors are allowed to add keywords, a keyword field becomes a thesaurus.
Thus there should probably be a hard limit to additional characters (total number of
characters controllable by the DKR administrator). Otherwise authors will quickly become
keyword spammers.
About the Author
Marcia J. Bates is Professor in the Department of Information Studies at the University of
California at Los Angeles. She has consulted and published widely in her specialties of
subject access, user-centered design of information retrieval systems, and information
seeking behavior. She is a Fellow of the American Association for the Advancement of Science,
and has twice won the "Best Journal of the American Society for Information Science Paper
of the Year Award."
Web: Other papers by Marcia Bates (http://www.gseis.ucla.edu/faculty/bates/)
E-mail: mjbates@ucla.edu
References
Art and Architecture Thesaurus, 1994. New York: Oxford University Press.
Bartlett's Roget’s Thesaurus, 1996. Boston: Little, Brown.
Marcia J. Bates, 1999. "The Invisible Substrate of information Science," Journal of the American Society for Information Science, volume 50, number 12 (October), pp. 1043-1050.
Marcia J. Bates, 1998. "Indexing and Access for Digital Libraries and the Internet: Human, Database, and Domain Factors," Journal of the American Society for Information Science, volume 49, number 13 (November), pp. 1185-1205.
Marcia J. Bates, 1988. "How to Use Controlled Vocabularies More Effectively in Online Searching," Online, volume 12, number 6 (November), pp. 45-56.
Marcia J. Bates, 2002. "The Cascade of Interactions in the Digital Library Interface," Information Processing and Management, volume 38, number 3 (May), pp. 381-400.
Abraham Bookstein, 1990. "Informetric Distributions, Part I: Unified Overview," Journal of the American Society for Information Science, volume 41, number 5 (July), pp. 368-375.
B.C. Brookes, 1977. "Theory of the Bradford Law," Journal of Documentation, volume 33, number 3 (September), pp. 180-209.
Y.S. Chen and Ferdinand F. Leimkuhler, 1986. "A Relationship between Lotka’s Law, Bradford’s Law, and Zipf’s Law," Journal of the American Society for Information Science, volume 37, number 5 (September), pp. 307-314.
Communications of the ACM, 2002. Special issue: "Ontology Applications and Design," volume 45, number 2 (February), pp. 39-65.
Ei Thesaurus, 1992- . Hoboken, N.J.: Engineering Information, Inc.
David Ellis and Ana Vasconcelos, 1999. "Ranganathan and the Net: Using Facet Analysis to Search and Organize the World Wide Web," Aslib Proceedings, volume 51, pp. 3-10.
Sara D. Knapp, 2000. The Contemporary Thesaurus of Search Terms and Synonyms: A Guide for Natural Language Computer Searching. Second edition. Phoenix, Ariz.: Oryx Press.
Shirley Loo, (compiler), 1998. Legislative Indexing Vocabulary: The CRS Thesaurus. 22nd edition. Washington, D.C.: Library Services Division, Congressional Research Service, Library of Congress.
Los Angeles Times Thesaurus, 1987- . Los Angeles: Los Angeles Times Editorial Library.
H.P. Luhn, 1961. "Selective Dissemination of New Scientific Information with the Aid of Electronic Processing Equipment," American Documentation, volume 12, number 2 (April), pp. 131-138.
Katherine H. Packer and Dagobert Soergel, 1979. "The Importance of SDI for Current Awareness in Fields with Severe Scatter of Information," Journal of the American Society for Information Science, volume 30, number 3 (May), pp. 125-135.
Jennifer Rowley and John Farrow, 2000. Organizing Knowledge: An Introduction to Managing Access to Information. Aldershot, Hampshire, U.K.: Ashgate.
Dagobert Soergel, 1999. "The Rise of Ontologies or the Reinvention of Classification," Journal of the American Society for Information Science, volume 50, number 12 (October), pp. 1119-1120.
Thesaurus of Psychological Index Terms, 2001. Ninth edition. Washington, D.C.: American Psychological Association.
Wordnet at http://www.cogsci.princeton.edu/~wn/, accessed 30 May 2002.
|