|
Nicholas Carroll
Date: 2/7/2001
Modified: 7/22/2001
This paper is an offshoot of Retrieving Information from Dynamic Knowledge Repositories - Overview, 02-dkr-ir-intro.shtml.
Until such time as artificial intelligence is perfected, dynamic knowledge repositories will benefit from "adding in" human intelligence to the metadata. Authors or creators of objects are probably the cheapest and most available source of metadata, and they are often quite familiar with their intended audience, whether their information is about cooking, flying, or flying saucers. (No, authors are not always familiar with their intended audience.) The last chance to capture this knowledge -- as metadata is usually when the author uploads the object to the Dynamic Knowledge Repository.
I suggest that the Open Hyperdocument System should have strong capabilities for authors to enter metadata. These are a few thoughts on standard "fields" that might be worth including in Open Hyperdocument System (OHS) data structures. It is not remotely exhaustive. However, most of the fields have similar meaning across cultures.
I'm not suggesting that the creators of documents should be forced to fill out each and every field. Nor am I advocating any particular data structure simply that the data structure should accommodate this sort of indexing. (When I refer to "fields," I'm using the term for convenience, not a prescription.)
What I will say is that dumping this metadata into a single "field" will not work. I am all for unifying namespaces in theory. Yet in practice, I've always produced better information retrieval with a judicious mix of a unified namespace and a few carefully-thought-out separate fields. If the OHS is going to have top-notch information retrieval, by whatever scheme, users will have to be able to be able to write precise queries, including or excluding certain fields whether it's done by RDBMS or code gymnastics. With the current quality of math-based search, I suggest a mix as a starting point. Because among other points it is a lot easier to unify namespaces later on, than it is to fragment them.
Author
Title of Author
Alias
Professional/Personal
Organization
Title of Work
Level of Work
Abstract
Keywords
Attitude
Source of Link
Geographic Location
Language
Date
The Wild Blue Yonder
Notes
Author
Authorship is a hugely important key to grouping documents by subject matter, quality, and level of knowledge. Author obviously implies subject matter; different people are knowledgeable about different things (see Professional/Personal further down).
Author also tells us about the quality of the knowledge, a perception we come to individually, or by word of mouth. It is tempting to say that "level" should be decided by the readers, not the author. I'm going to digress from my short-and-snappy style for a moment here, to point out that reader opinions are questionable. Here are some of the problems:
- With a large user group, voting is usually a disaster, as the authors who pander to the masses outrank those with valuable knowledge.
- With a smaller and usually more specialized group, voting invites experts to slander their peers.
- And in either case, the merely popular can outshine the truth.
The system that does work, surprisingly well in many cases, is the one used by Zagat restaurant guides: user comments, editorial selection. The words are from the users, the choice of comment is the community manager's. The editor gets to choose, but not speak; the users can say anything, but cannot choose.
Level is a hugely overlooked element of authorship. No matter how lucid a writer may be, they inevitably address the bulk of their writing to a certain level of sophistication. The best calculus teacher on Earth may be incomprehensible to a third-grader, and the best third-grade teacher is useless to a calculus student. Thus author name is an important variable for extracting information relevant to the user's current level of sophistication. In a way, authors creates a multi-leveled Help file, and an end to the Microsoft "useless after 3 months" style of help. (Multi-leveled help files were part of Doug Engelbart's original specifications for NLS, but the spec was waylaid.)
Title of Author
Position, rank, and honors would be useful as a searchable field. People in class- or rank-sensitive organizations remember these things extremely well.
Alias
Some people don't want to attach their names to published documents. Many Americans would be puzzled by such an attitude. People living in totalitarian regimes will need no explanation. To those who don't understand, let me put it this way: like it or not, some people refuse to use their real name and if you don't allow them to use an alias, the world will never have their knowledge.
Professional/Personal
On the Internet people publish documents about their family, dogs, and vacations as well as their area of expertise. A professional/personal switch would later allow users to filter out the personal data, or vice-versa.
Organization
The organization an author works for, or is affiliated with.
Title of Work
Title of the document or object.
Level of Work
Beginner, intermediate, advanced. Where the author's name may imply level, this states it.
Abstract
The contents of the document described in one brief paragraph.
Keywords
User choice of keywords is the salvation and bane of searchable data structures. While spammers abuse them out on the Web, there are excellent reasons to have a keywords field in a DKR, including:
- alternate or incorrect spellings of author name
- alternate or incorrect spellings of important words
- related concepts
Inevitably, when authors are allowed to add keywords, a keyword field becomes a thesaurus. Thus there should probably be a hard limit to additional characters (total number of characters controllable by the DKR administrator). Otherwise authors will quickly become keyword spammers.
Attitude
My study of internet search phrasing suggests that a huge proportion of searches are conducted with a bias, whether emotional or rational. People are looking for pro, or con, or comparisons. Those searching for contradictory or negative information often resort to adding "sucks" to the search phrase. However, a political paper entitled "Why Proposition 44-R is a Bad Idea" is unlikely to contain the word sucks.
Attitude could probably be expressed by fairly small set of pre-defined choices, such as:
pro / con / tangential / vs / review / comparison /
I suspect that, give the opportunity, pro or con authors would use such an indexing option with glee. Review, vs, or comparison authors would use the indexing option out of professional diligence. I'm not sure how to make this one work across language barriers.
Source of link
Indexing the human who first brought a document to your attention can be an invaluable piece of retrieval data. This is immediately apparent watching a pine user searching email; they routinely search by name of sender. One can see the opposite pole watching Win/Mac newbies vainly search countless old emails for a bit of information. It is equally telling to watch a sophisticated computer user using a Win/Mac machine; they promptly open the consolidated email files in any available text editor, and search the whole file for the sender's name.
In DKRs the value is not quite so obvious. Who cares which stranger posted a link? But most DKRs become familiar to their regular users and thus one remembers "... that link to elephants' dietary habits? Ah! George posted that one!"
Geographic Location
Encouraging indexing by country-state-city is a no-brainer. There are some details, though like provinces, cantons, and counties. This needs to be addressed, but is fairly trivial.
(People living in large countries may not see the importance of searching by country of origin. Well-travelled Europeans or Southeast Asians need no explanation. When you live in Luxembourg or Singapore, many of your acquaintances are "foreign.")
Language
Indexing by language is also a no-brainer. For most people, shifting to another language is a huge jump in frame of reference, very easily remembered, and thus an excellent indexing criteria.
Filtering by language does more than shift the mind, it allows one to pull up local information. Eg., a search phrase of |France Loire vacation| could pull up either English- or French-language pages, as the words are spelled identically in both languages. Without language filtering, a searcher who strictly wants pages written in French assuming they know search syntax has to search |France Loire vacation +avec|, on the assumption that the common French-only word "avec" will be found in 100% of the French-language pages and 0% of the English ones. (This is actually a fairly good assumption. But few searchers are that sophisticated.)
Date
Date is especially valuable for searching email. Imagine a fairly typical salesman, with 3,000 emails archived in various directories. He is looking for an email he received a few months ago about technical details of a new type of steel pipe fittings. Unfortunately, he works for a steel pipe fitting company, so 2,500 emails contain that text string.
However, the memory's associations being odd things, he remembers that he printed out the email and sat on the couch to read it. His office has no couch ergo, he was at home. So a search for "steel pipe fittings" in emails received on weekends in the last 4 months promptly reduces the search from 2,500 emails to 20.
Date should be searchable by morning-afternoon-evening-night; day of week; date of month; month; season; year; decade. Range alone is not adequate.
The Wild Blue Yonder
Specialty DKRs will need specialty fields. Consider these few criteria for indexing people:
Order of birth - Indonesians use nouns such as Nyomen (second-born?) as names. Thus they can remember "the second-born guy I met last week."
Appearance - hair color is not a good identifier in China, where black hair is common. In Sweden black hair is extremely memorable.
Past there even the personal classification schemes run into infinity. Musicians remember each other by the instruments they play (e.g. "a keyboard man"). Shoeshine boys describe customers by the shoes they wear. Auto mechanics discuss customers by the cars they drive.
So the structure needs to be extensible.
Notes
1. While we all look forward to a world of valid XML documents, I think it would be foolish to force too much structure on OHS users. Users really don't like to be bullied, and where a corporation may be able to bully its employees into filling out each and every data field, a software system that has to spread through user appreciation should be friendlier. Metadata will always have fields left blank; it's the nature of the beast.
2. I'm presuming that DKR catalogers, working at a lower level than authors, will also have access to all these fields, and perhaps more.
===============================
References
Authorship Provisions in Augment, Douglas C. Engelbart
http://www.bootstrap.org/oad-2250.htm
Yahoo! Cataloging the Web, Anne Callery
http://www.library.ucsb.edu/untangle/callery.html
This appears to be a shortened and perhaps bowdlerized version of the original. I may still have the original in my own archives.
|