HRI Home > Technical Papers Index > The Anti-Thesaurus Part 2

 

The Anti-Thesaurus Part 2:

Expansion of Proposal for Increasing Search Relevance

 

By Nicholas Carroll
Date: December 19, 2001
Modified: N/A

Please read the first Anti-Thesaurus metatag paper before this one; it gives the nutshell of the proposal.

Note: response to the first paper has been so overwhelming, that I would like to remind readers that the Anti-Thesaurus is a proposal for a new metadata tag. No such tag is currently indexed by any of the major search engines.

What's the Word?
By popular demand of numerous emails, I’ve changed the word from the ugly "nonwords" to the much snappier "exwords", for "excluded words." My thanks for the feedback.


down arrowTechnical and Information Retrieval Issues
down arrowPhilosophical Issues
down arrowBusiness Issues
down arrowFAQs
down arrowWhat Exwords Can Accomplish, In Greater Depth - Two Examples
down arrowConclusion

Technical and Information Retrieval Issues

Will It Work?
It did work. I’ve already tried it in a company searchbase.

Which leads to "will it scale?"

I’ll get to that.

A Note On Proper Use
Proper use of any metadata, in my opinion, flows from experience. The experience can be rooted in any number of disciplines, but regardless of the source, it is clear to me that someone who cannot read and understand the patterns in web site log files, including search string syntax, has no business monkeying with search engine rankings.

(I know of no single discipline or degree that automatically qualifies a person to choose metadata destined for Internet search. That includes information science, for which I have high regard.)

Nor can one just hand the job over to the professionals: search engine optimizers. Odd but true: because SEO companies operate as consultants, they rarely get to learn a company and its customers from end-to-end. The clients have their notions, such as "get me to number one for yadda-yadda", and that’s all they’ll pay for. End of billable hours; end of the learning curve. By the time the optimizer has learned a bit about selling tennis rackets, they’re off to a new client who wants to sell furniture. They see the patterns, but they’re never paid to develop the depth. (In-house SEOs, by contrast, learn the business in depth, but never get the broad exposure to searchers needed to develop their intuitive pattern recognition.)

In fact, hands-on learning with no technical background whatsoever has made more than one business owner a master of search engine optimization. The legendary Jim Rhodes, one of the founders of the SEO profession, was entirely self-taught, and cut his teeth promoting the web site of a hotel in London. As far as I know, all it took him was genius and infinite tenacity.

Misuse of Metadata
1997-1998 was perhaps the heyday of meta keyword excitement, in which newbies were running around copying keywords from highly ranked web pages, and then pasting them into their own pages, in the hopes of raising their pages in the search engines’ indexes. The press wrote about "keyword jacking." Lawyers thundered about trademark, and filed lawsuits over, yes, "keyword infringement."

One of the more bizarre incidents was the pink poodle grooming episode. AltaVista had put up an instructional page on the proper use of the meta keywords tag, for a fictitious dog grooming parlor:

<meta name="keywords" content="grooming pink poodles">

Hundreds of webmasters throughout the English-speaking world diligently copied the example into their own site’s HEAD element, with the result that searching AltaVista (or any search engine indexing meta keywords) was soon returning hundreds of web sites for grooming pink poodles. (I’m sure many webmasters just forgot to go back and add the proper keywords. And some were unclear on the concept.)

(One of the last remaining examples on the web: http://www.fkkt.th.com/ Check the document source HTML HEAD element for the meta description and keywords. If you want to give the webmaster a wake-up call, go to AltaVista first, search "grooming pink poodles", and click through from AltaVista.)

I saw other hilarious abuses, like the arthritis web site that keyword jacked an actor in California, and rocketed to the top of three search engines for her searches. (Why? Who knows!)

And in the end: it didn’t make much difference, one way or the other, no matter whose keywords you "jacked." Keywords had already become the starting point of newbie search engine spammers, and all the search engines were downgrading their weight.

Potential Misuse For an Exwords Tag
Just as problems can arise when Meta Keywords are used without proper knowledge, misuse of an Meta Exwords tag could end with unfortunate results. A couple of obvious examples come to mind:

Getting your exwords from the micro-marketers (marcom department), and mindlessly add them to the tag, without running it by people who understand logs, behavior, and patterns. Talk about tunnel vision! Outliers (those little dots on a distribution graph that lie outside the main grouping) are hugely important to the inquiring mind, and evolution itself. After all, in the age of dinosaurs, dinosaurs were the mainstream, and proto-human lifeforms were the outliers. (If you don’t like outliers, then take a look in the mirror; you used to be one.)

Using CSS or XML, or any tool for mass-formatting web pages, seems like an excellent candidate for disaster. It’s not hard to envision a computer manufacturer producing an XML template for a particular sub-site devoted to promoting their support for the Boy Scouts, "exwording" the word "computers" – and then inadvertently spreading the template throughout the company web site. Whoops....

Abuse / Security
I note Usenet is already buzzing with thoughts about exploiting an exwords tag. Ah, inquiring minds! However:

The potential for abuse would rely largely on the naiveté of search engine administrators, and at the better search engines, the admins are no longer naive. (Paranoid, perhaps. But remember the adage that even paranoids can have enemies – in the case of search engine admins, several million enemies.)

I spent relatively little time on technically manipulating search engines, as I had decided fairly early on that quality content was the best way to gain rankings for the long term.

Nevertheless, I spent a fair bit of time with the great technical index spammers, and I keep up with some of the literature. My gut sense is that there is at least one great hack out there – at least one way to take the search engines to the cleaners, but good. Front-page news for the technical media.

I also sense that the hack would be very, very short-lived. "Exwords" being a narrowly defined type of metadata, they would not offer the variety of spamming opportunities available by manipulating a web page’s entire content. Most of the garden-variety possibilities for spamming the search engine indexes could be stopped with some fairly simple algorithms.

(Clearly anyone successfully cracking a web site could potentially use a "exwords" tag to effectively delist their competitor. I think that problem belongs to web server systems administrators – who are already responsible for handling everything from doorknob-rattling to ddos attacks. Altering a web site’s exwords would be of dubious effect anyway; the sys admins would quite likely have the original metadata restored even before the next search engine spider arrived. If not, there's always the panic-stricken email to the search engines: "We were cracked! Please crawl again!")

Implementation
An anti-thesaurus would involve no elegant coding at all. It would simply allow web site owners and webmasters to specify search words and phrases under which they no longer wished to be found.

The coding is all on the search engines’ end, and it’s pretty simple coding at that. Chris Sherman observed in a Search Engine Watch article that it’s essentially a matter of giving web site owners the option of adding Boolean NOT for certain words. That’s exactly how it was first implemented. Along with the database’s thesaurus field there was an anti-thesaurus field. Just as administrators could add to the thesaurus, they could add terms to the anti-thesaurus. There was some tweaking with namespace unification – I had all relevant fields except the anti-thesaurus consolidated into one field for search purposes – but this was about it, for a three-term search:

Find word1 AND word2 AND word3 in KEYWORDS_UNIFIED

Return matching record numbers to memory

Go to first record

Search KEYWORDS_ANTI field for word1

IF found, delete record number from returns and search next record; IF not found, search "K_A" field for word2

etc., etc., until all the records had their K_A fields searched, and had been left in the final list of records to be returned, or dropped from the list.

Fairly crude processing, but fast enough with an adequate RDBMS. (A search engine would probably want somewhat more elegant code.)

Will It Scale?
In one sense that’s a non-question. After all, the bulk of the work would fall on individual web site owners. To the owner of a bed and breakfast in Hawaii, who is tired of inquiries about helicopter tours – just because their web site mentions helicopter tours one time – it’s quite simple. Add "helicopter, helicopters" to the exwords tag. Now your web page content still promises that you will arrange helicopter rides for your guests, but you no longer get visitors who are looking for helicopter tours instead of bed and breakfasts.

In that sense it would scale easily – because there is no scaling. Exwords are applied to particular pages, small websites, or particular sectors of websites. It’s a hand-coded thing, for the most part.

If you’re thinking of adding exwords tags to all the pages of www.hp.com – no, it won’t scale, not without real professionals doing the work. Be prepared to hire an information professional (information science, information architect, or top SEO) for at least six months.

What Exwords Can Accomplish – a Walk-Through
I originally created a exwords field in a searchbase to nullify searches that were clearly pointless. Finesse came later.

Repeat: "clearly" pointless. Clarity is in the mind of the beholder – and obviously some respondents to the first paper do not suffer clarity – but people who actually pore through web site logs should have no trouble understanding this: an anti-thesaurus has its limits.

Since three separate commentators came up with the "asparagus spears" and "Britney Spears" example, I'll use that:

You are an asparagus farmer, selling asparagus. Every day your web site gets 20 visitors looking for asparagus, and 200 looking for Britney. If search engines recognized an exwords tag, you would simply add:
<meta name="exwords" content="britney">
to your pages. In a few months, once all the search engine indexes were updated, you would no longer be getting any visitors looking for Britney Spears – just asparagus.

That’s about as clear-cut a need as there can be.

There are more complex sets of judgements. I’ve included two examples near the end of this page, with sample keyword logs from a couple of this site’s pages.

What Exwords Cannot Accomplish
The toughest possible example of excluding irrelevant searches may be a punk band web site. Such sites routinely include obscene words in song lyrics. A Slashdot post about my original paper was both comic and frustrating. I later heard from the webmaster, who was eager to reduce sex-searching visitors, but to no avail. Nor do I have any idea how to solve his problem. When it comes to sex; the variation in search phrases is absolutely mind-boggling.

I noted this some years ago while watching real-time web searches. There is an unfortunate trinity between porn-seekers, misspelling, and punk band sites. Where punk bands intentionally misspell, either their names or in lyrics, porn-seekers seemingly cannot spell. In the vastness of cyberspace, indexed by search engines, their paths cross, with bizarre search returns.

Sites dealing with magic suffer the same problem to a lesser degree. I recall watching a realtime search on one of the metaspy tools some years ago, when web traffic was light enough that you could see all the searches of a particular visitor. The sequence went:

First search: blac majic

20 seconds later: black majic

2 minutes later: black magic
(Went and found a dictionary? No... that searcher probably didn’t own a dictionary. Might have called a friend.)

I searched for punk bands named Black Magic, found none, and concluded the searcher was indeed searching for "black magic." Heaven help the magicians’ supply shop that had the word "black" on their page.

Blogs are another tough issue. This is not because they suffer too many porn hits (though some do, like the meryl.net blogblog owner who is tired of people looking for meryl streep nude). But blogs are wide-ranging, many-faceted creations. I’m not sure where you would begin exwording, or whether you would want to. It would be somewhat akin to applying exwords to Usenet. Surely the gods wanted Usenet to remain Chaos, or they would not have created alt. groups.

Substitutes That Won’t Work
Some inquiries have asked whether there are immediate ways to implement the concept, without the actual exwords metadata tag. Yes, there are hundreds – but aside from that being another paper in itself, all of the tactics require a thorough knowledge of current search engine protocols. Sticking with the simpler ones:

Some emails have asked about tactics like repeating "sex" 100 times in a row in the keywords metatag. Yes, that will get a page delisted for "sex", alright, on many search engines. It may also get the page, or the entire web site, delisted period – just bounced right off a search engine’s index.

Or repeating "sex" 100 times in the body, with white font on white background, so the spiders can read it, but the viewers can’t? Same thing. The search engine administrators stopped seeing the humor in that one a long time ago.

Most of the equally obvious tactics have long ago been penalized by the search engine administrators, and could cause a site to be completely delisted from most search engines.

Simple Substitutes That Will Work
You are looking at one. I’ve put the keyword logs described further down this page on separate web pages, and listed those pages in Hastings’ robots.txt file. Otherwise this paper would have been getting thousands of irrelevant hits a year, and polluting the search engine indexes with yet more irrelevant content.

It didn’t take much time. It’s good ethics. It’s also good business. (I’ll get back to business further on.)

Another is to rewrite the content to eliminate the phrases or words that draw irrelevant searches. Aside from being labor-intensive, there are several good reasons why this is a lousy solution. For one, it corrupts the original version. Second, it means webmasters would be editing content which they may not have written, probably without the authors’ permission. Third, it smacks of censorship, even if it is self-censorship. While this solution may be simple at first blush, it’s a pretty poor substitute for an exwords tag.


up arrowPhilosophical Issues

Serendipity Lost?
This was a question that came from Cory Doctrow on boingboing.com (and a few people who seemed more interested in random surfing than serendipity): what will happen to serendipity in search, if everyone starts excluding their web pages from every searcher who appears irrelevant to their organizations/personal goals?

It’s a legitimate question, with three immediate answers:

First, most web sites won’t use a meta exwords tag anyway. As previously said, they won’t hear about it, understand it, or get around to it. We can be assured of irrelevant search returns for years to come.

Second, irrelevant returns from search engines are a pretty poor tool for producing serendipity. Even to a seriously lateral thinker (a "baroque mind," if you will), most irrelevant returns are just garbage in. A program called "Mindfisher" came along well before the Web, with the exact purpose of creating serendipity, and did a much better job than bad search returns. (Which is hardly surprising; search engines are not trying to deliver serendipity.)

In fact, there are business opportunities here:
www.tangential.com – what a great researcher does
www.verytangential.com – what a very great (or lousy) researcher does
www.utterlyrandom.com – what people with spare time do

Third, it could not happen anyway, for several reasons:

1. Expert search sites such as Alltheweb, Google, and Northern Light would probably add a Exwords On/Off switch to their advanced search options – an exceedingly simple piece of coding for people who have largely won the pornography wars.

2. Directories such as Looksmart, the Open Directory Project, and Yahoo don’t index content anyway, much less metadata.

3. Regardless of how popular exwords screening might be with search engine users, some search engines would never get around to implementing it.

4. There will always be new startups, because search is cool. My bookmarks are stuffed with search engine startups based on topic mapping, taxonomies, voting, ontologies, and NLP. Some of them will survive and flourish. And each of them will give different search returns.

5. Exwords won’t affect whether humans post a link to a blog, or send a friend a link.

Conclusion: it takes actual mass censorship to eliminate serendipity on the Internet. In the relatively free nations, serendipity will remain.


up arrowBusiness Issues

There have been numerous posts on various discussion groups in reaction to the first anti-thesaurus paper, asking, "What web site would want less traffic?"

Perhaps this belonged in the Philosophy section, but it was raised as a business question. Peter Suber, editor of the Free Online Scholarship Newsletter, answered it best: "Why turn away visitors, even if they're visiting in error? For the same reason we put up free content: to be helpful."

Still, to the More-Traffic-Is-Always-Good School of Thought, that begs the question. So, back to "What web site would want less hits?"

The answer is: most web sites that are serious about business.

Sticking with Britney Spears, imagine that your asparagus web site is pulling 100,000 visitors per year looking for Britney. What will you do with these hits? The dot-com madness is over, and selling "eyeballs" isn't what it used to be. Are you going to bash off an email to her agent? It can take half a day to track down an agent even if you know the music industry. Why is the agent interested in a lousy 100k visitors? Britney can make the front page of People magazine by changing her hair color.

I oversee one free legal information web site (not a banner ad to be seen). Some of the topic pages rank at the top of all the major search engines, which must drive law firms bonkers. I was approached by one legal services company that was selling the exact service described on my page, asking me for a link to their site. Curious, I asked them what the clickthroughs were worth. The marketing guy told me that they had no need of my clickthroughs, because their company had been featured in the Wall St. Journal. Then he offered to "trade links", between my highly-ranked, high-traffic, non-profit site, and his unfindable purely-for-profit web site. Sort of a "we need you but we don’t need you" relationship. End of that "relationship" – though it wasted a couple of hours of my time.

"Selling eyeballs" made a few instant billionaires before the dot-com bubble burst. Now the Web is back to normal – and now selling advertising is a business, a whole different kettle of fish from dot-comming. It takes time, money, and work to sell advertising space. And that is true regardless of whether your ad space is valuable or worthless. The dollars are not chasing the Internet ad space anymore.


Return to TOCFAQs

The "What If?" Department

"Dear Nicholas:
    What if I delist the word ’helicopters’ from my bed and breakfast site, and then a billionaire goes to Google and searches bed and breakfast Hawaii helicopter tours ?? Omigawd!!!
    Signed,
         Worried in Hawaii."

Ladies and gentlemen: life is uncertain. You take your chances. That is why I say that someone who does not understand their subject, their visitors, and their web site logs, really has no business monkeying with an exwords tag – or metadata in general. (Or even writing content....)

Likewise I say that – if you’re wired for it, and have patience – you can learn by doing.

Quibbles Department
"Google has already perfected search!"
    I'm sure Google is flattered by this view. But apparently the Google developers differ, since they recently added a feedback mechanism asking users to rate the quality of individual searches.

"The logs in the appendix don’t list search engine source or number of visitors!"
    Correct. That’s intentional. The purpose of this paper is to get the basic idea across – not to provide material for debating the relative merits of search engines.


up arrow What Exwords Can Accomplish, In Greater Depth - Two Examples

Below are examples of search words and phrases used to reach a couple of Hastings pages.

The first is a list of search terms used to reach a fairly obscure non-commercial technical paper on this site called Technical and Human Considerations In Creating Precision Hyperlinks.

Note that before even starting to make judgements about what constitutes a "pointless" search, I would have to ask "How much information is there already on the Web about this subject?" If the answer is, "lots" – then that raises a second question, "What is the quality of our information compared to other available web pages?" If there is lots of information available, and the relevance of mine is relatively low, I would be more liberal in adding exwords to the metadata.

However, in the case of this paper, the answer would be: little to none. The subject is so obscure that the paper is almost canonical, especially considering that it links to some of the few other Web pages on the subject. This in itself would prompt me to be sparing in the number of exwords I added to metadata.

Incoming search phrases are grouped by decreasing relevance (the groupings represent my best guesses, not certainty). In any case the reader will certainly get a sense of my doubts about assigning "% relevance" to search returns. Obviously this is opinion, not clairvoyance. However, it is opinion based on personally analyzing several gigabytes of log data over the last few years, as well as crunching another few terabytes for patterns.

The paper: http://www.hastingsresearch.com/net/01-precision-hyperlinks.shtml
The search terms and phrases used to reach it: http://www.hastingsresearch.com/net/sub/AT-example-1.shtml

Now, that was an example of a tightly focused technical paper, and still it draws a certain percentage of visitors who want to be elsewhere.

Here is an example of an essay that runs wild and free, pulling thoughts from technology, business, religion, magic, psychology, history, military history, and the dot-bombs: Taunting the Gods of Business, at http://www.hastingsresearch.com/whitepapers/taunting.shtml
The search terms and phrases used to reach it: http://www.hastingsresearch.com/net/sub/AT-example-2.shtml

Skipping the numerical analysis of what should be "exworded" – since I haven’t included hit counts – a couple of obvious patterns nevertheless jump out of the terms list. (I haven’t broken down the searches as with the previous paper, because I’m short on time.)

First, there is no way to classify such a paper in a hierarchy; it has too many facets, and could easily be classified as:
dot-com failure analysis
internet business models
analysis of Amazon.com
business promotion
hype limitations
religion hype

among numerous other legitimate categories for which it provides significant amounts of content.

So for a paper like this, a good search engine "does it better." Yet, since a search engine can roughly be thought of as an "infinite categories" search system, visitors are also finding it by inappropriate facets like:
religion
canada
everest
hardware
taunts

The last four are clear candidates for exwording.

The second obvious pattern is the variety of phrasings visitors might use even though they are seeking the same information. (More on that available at Indexing and Access For Digital Libraries and the Internet: Human, Database, and Domain Factors, Marcia J. Bates. Use your Find tool to locate the section called "Multiple Terms of Access".)

In my experience, due to the multiplicity of terms used, it is a lot easier to choose exwords than "ex-phrases." In the asparagus/Britney example, one word is the source of the asparagus grower’s problem, and one exword pretty much solves the grower’s problem: Britney.

For Taunting the Gods of Business, "Ponzi" and "paperclips" are some fairly obvious candidates for an exwords tag.

Moving into phrases, judgement becomes more questionable, and definitely more time-consuming.

chinese business gods is, I suspect, an irrelevant search. With the exception of Monkey, the gods in that paper are Greek, Norse, or of my own creation.

Likewise paperclip manufacturers. The only reason people come to this web page is because paperclip manufacturers aren’t online. Maybe a true "Web citizen" would patiently assemble a list of paperclip manufacturers, and post it on their site. I would just exword the search.

But how about dot com business plan generator? One might guess what the searcher wanted, some sort of tool for churning out a business plan, perhaps to show VCs. But the visitor might indeed learn something useful. This could be serendipity. Leave it alone.


up arrowConclusion

I wrote the first paper to briefly state something that seemed pretty straighforward and self-evident. (It still does, and I'm still puzzled by the level of response.)

However, I am far from proposing that people should spend weeks poring over logs, in a vain effort to "perfect" the traffic to their site. To the contrary, it’s just a 30/1 rule: 30% of your unwanted hits could be coming from a quite obvious 1% of the search terms. Exword those 1% – and move on to better things. If we clean our own doorsteps, the Internet will be a better place.

I’ll leave the last word to Search Engine Watch:
"Carroll’s proposal is an interesting read. Whether anything comes of it is another matter, but it’s certainly an idea worthy of considering by standards committees and search engines alike."

###

Confession: I have a job, and promoting metadata standards isn’t it. So far I’ve managed to answer all emails about the original paper, but can’t promise to keep up. My hours are pretty well accounted for, and then some. If I inadvertently overlooked your email, my apologies. If the subject is still timely, please send again.

Responses to the first paper, both positive and negative, seem to assume that I’m an information science or retrieval professional by education. Hardly. I studied MIS – not IS – and classes seemed to be about everything but finding useful knowledge. So I learned information retrieval by doing it. When the software industry adamantly refused to provide me with the tools I needed to run businesses, both in financial and market analysis, I started writing my own information retrieval and presentation packages – like many other people, (re)inventing RDBMS variants like star schemas, and other OLAP precursors. When the web took off, I discovered log files, and started crunching them by the thousands of hours. Only then did I start reading information science.

Please send comments to Nicholas Carroll
Email:


http://www.hastingsresearch.com/net/07-anti-thesaurus-part2.shtml
© 1999-2010 Hastings Research, Inc. All rights reserved.