HRI Home > Technical Papers Index > Creating Precision Hyperlinks

 

Technical and Human Considerations
In Creating Precision Hyperlinks

 

Nicholas Carroll
Date: December 22, 2000
Modified: January 1, 2001

These are some questions – and a few answers – about ways to structure OHS hyperlinks. This is very loosely based on previous writing by Doug Engelbart and John Rothermel on the NLS/Augment systems, and my own studies in market systems.

A central feature of the Open Hyperdocument System and its DKRs (Dynamic Knowledge Repositories) is granular addressability – the ability to point to a particular spot in a document or other object. Specifying the view is another feature that might be specified within a link. And "human- and machine-readable" is a needed feature if links are to be transmitted accurately. In that light, I wandered over most of the standard ASCII character set (staying within the 32-127 bit range), wondering what a "precision hyperlink" might look like. (Note: URL extensions such as I’m describing are sometimes called "fragment identifiers.")

Precision hyperlink goals:

Technical
1. Machine-readable. The computers won’t choke on them.
2. Provides granular addressability within documents.
3. Defines presentation format of the information (embed in current window, or spawn new window; define size and shape; etc., etc.)
4. URLs that have a reasonable hope of travelling uncorrupted through all manner of copy and paste operations, servers, and applications, including email clients and WYSIWYG editors.

Human
5. Human-readable. URLs can be accurately passed from human to human without a computer transmission, by writing or speaking.
6. Memorable. URLs can be recognized later, even if not bookmarked.
7. Descriptive. The URLS clue you in to the document’s content.

down arrowThe Special Characters
down arrowLetters and Numbers
down arrowHypothetical URL Formats
down arrowNotes

up arrowThe Special Characters

Which characters should be used in precision hyperlinks? Looking at the special characters available on a U.S. keyboard: ! " # $ % & ' ( ) * + , - . / @ : ; < = > ? [ \ ] ^ _ ` { | } ~

These Already Have Definite Meanings

# (pound) – HTML anchor Any reason it shouldn’t serve as an OHS node delimiter?

% (percent sign) – character code delimiter

@ ("at") – email

. (period) – URL delimiter

Poor Candidates

, (comma) - a commonly used field delimiter

~ (tilde) – hard to read on bad monitors. The presence of a tilde also implies a low quality file, that could go 404 some day.

/ (forward slash) – is a universally recognized server flag for a subdirectory.

\ (backslash) – likewise a subdirectory separator.

- (n dash) – hard to recognize, users variously describe it as "dash," "minus," and "what’s that mark between the letters?"

_ (underscore) – would certainly need a new name. Furthermore, they disappear when URLs are highlighted on HTML pages, leaving the eye unsure whether it is seeing an underscore, or an ill-advised space (%20).

| (pipe) – a commonly used Unix field delimiter. Using it in a URL would wreak havoc with server logs.

? (question mark) – is commonly used to call .cgi scripts; different servers react to the character in different ways, depending on its placement in a URL.

! (exclamation point) – easily mistaken for 1, i, or l.

$ (dollar sign) – may create difficulties for Perl scripts. I would appreciate feedback on this.

< and > (less-than and greater-than) – browsers tend to see these as containers for HTML, and will often hide the tags. The problem is replicated when HTML-aware email clients hide them – sometimes displaying the contents as a URL itself.

& - is commonly used in dynamically generated URLs, and is generally server- and client-compatible. However, difficulties arise when URLs containing ampersands are coded into HTML pages with WYSIWYG editors, particularly MS FrontPage, as these programs are prone to entifying "&" into "&amp;" or numerically rendering them as "&#038;". And browsers themselves may try to entify any sequence of characters beginning with an ampersand.

Better Candidates?

= and + have possibilities. They are certainly understandable, and both are routinely used in dynamically generated URLs, apparently with no server- or client-side problems.

= and + are routinely used to separate .cgi GET form information. Does this lead to problems?

Search Engines Create Additional Limitations

For example, the Lycos search engine documents state:

"Please do not submit webpages with these symbols in the URL: ampersand (&), percent sign (%), equals sign (=), dollar sign ($) or question mark (?). Our spider does not recognize them."

In short, URLs containing those symbols will not be indexed. While I believe Lycos is more discriminating than some other search engines, the "?" makes most search engine spiders burp. Getting the full story on all major search engines could be a substantial task.

up arrowLetters and Numbers

Letters

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z

While using special characters primarily raises technical questions, the issues in using letters and numbers are almost entirely human.

The harshest test for spoken communication of URLs is reading them over the telephone. Americans have trouble enunciating "m" and "n", Canadians have trouble with "f" and "s". Spanish speakers cannot distinguish between "b" and "v" (in fact, in Spanish there is no distinction). It gets worse across borders, much worse. From the tongue of non-native speakers, letters quickly become gibberish. That's why military radio operators use "tango" and "x-ray".

It may be a moot point, though. As of Nov. 8, 2000, with Verisign’s announced plan to accept Chinese, Japanese, and Korean characters in domain names, the notion of "human readable" may have collapsed into "human readable within a culture."

Still, this may be adequate. As the W3C RFC "Character Model for the World Wide Web" (http://www.w3.org/TR/charmod/) points out, it seems lunacy to force Chinese internet users to generate Latin characters from their Chinese keyboards, when 99.99% of a Chinese-language document’s accesses will be by fluent Chinese speakers.

Note: There is a huge thrash in the offing as everyone comes to realize that a) the world’s combined character sets (over 170,000 characters) won’t even fit in Unicode, and b) the vast majority of Oriental characters can’t possibly be rendered in the 10-pixel height of a typical browser’s URL box. See "Why Unicode Won’t Work on the Internet" on this web site.

So far writing letters has been simple, since URLs were restricted to the English ASCII characters. The most likely confusion was between "i" and "j". I cannot predict what the addition of Oriental (and later, Arabic) characters will do. However the addition of Spanish and Portugese characters will quickly cause confusion. (E.g., a Spanish company might have www.pena.com, or www.peña.com, depending on when they had registered the DN.)

Numbers

0 1 2 3 4 5 6 7 8 9

Arabic numbers, understood around the globe, are far more universal than alpha characters.

It has long been noted in marketing studies that in speaking, numbers are transmitted more accurately than letters. There is massive research on this from phone companies in all countries. Most North American and Western European phone companies also seem to have come to the same conclusion about string length: if you’re going to string more than six digits, break them up with separators. (Being about communication practices, not raw memory, this truth has no relation to the psychology observation that "the average human can remember seven digits.")

Numbers are also more accurately transmitted in writing (confusion between differing renditions of "1" and "7" being the notable exception).

Combinations

The best format is probably mixed alpha-numeric.

For example:
/0573-OHSdev.html
is a relatively compact way to say "the 573rd document, chronologically, in the OHS development discussion." (With the journaling allowing for 10,000 documents in that sequence, from /0000-OHSdev.html to /9999-OHSdev.html)

For alpha descriptors, there is plenty of research to draw upon, from stock market ticker symbols such as "XRX" (Xerox Corp.) to military descriptors such as MRE (meal, ready to eat) or BTS (boots, too small). Either of these systems – typically stripping out vowels for compactness – has a lot to teach about retaining meaning when condensing words or phrases.

(Warehouse inventory systems and library indexing have little to teach about URL creation; while either system demonstrates that humans can memorize thousands of numbers, both systems have substantial visual and locus cues to supplement the number. A warehouse worker not only remembers that number 273xxxxx is 15-inch Bosch windshield wipers, he also remembers that prefix "273" is Bosch parts – which come in yellow packages. The librarian confronted with a Dewey decimal system not only remembers the book title – but the book’s location by aisle and shelf.)

up arrowHypothetical URL Formats

URL length should allow them to be human-readable, browser-clickable, dummy-copyable. Workable URLs have typically had three length constraints:

  • 40 characters – screen width of some older text browsers
  • 65 characters – some email clients fail to include "excess" characters in the URL, or line-wrap at 65
  • 80-character screen width of some 2nd generation text browsers
    (I am ignoring WAP, which probably won’t be around long in its current form.)

All three limitations will doubtless be ignored in the wake of the InterNIC's recent 67-character domain names (formerly a 26-character limit).. E.g.:

http://www.67-character-web-addresses-are-available-from-networksolutions.com

Including the "www" prefix, this totals 71 characters.

Consider how frequently people write down 7-digit telephone numbers incorrectly. Further consider human sloth, plus dyslexia, illiteracy, and other human difficulties. This leaves two choices in URL length:

1. The devil with users. A good example of bad linking might be:
www.67-character-utterly-incomprehensible-drivel-web-address.com/about-us ... /board-of-directors/CEO/favorite-ski-resorts.html=v42-a=#Jackson-Hole

2. Keep it down to a sane limit. That way, bozos can hang themselves with absurdly long domain and file names, and the more intelligent web architects can make their URLs as terse as possible. Moving onwards, on this second choice....

URL length limits are a prime determinant of effective URL structure, which immediately raises the question of directory structuring.

To subdivide or not? Recently confronted with the task of putting a 100-chapter book online, on an existing site, I experimented with several schemes, trying to meet the tech and human requirements specified at the top of this paper. First I tried to follow the book's structure, resulting in something like this (76 characters):

http://www.long-domain-name.com/book/section-1/descriptive-chapter-name.html

However the indexing scheme proved illusory. Neither the users, nor I (the book's author) could see inherent patterns in the organization. Our ability to perceive patterns was no better than the design of the index page – and while a competent book editor can do much to structure a table of contents, there are not that many competent book editors to go around the planet Earth. After a few more tries, I ended up putting the book online thus:

http://www.long-domain-name.com/book/001-descriptive-chapter-name.html
http://www.long-domain-name.com/book/002-second-chapter-name.html
http://www.long-domain-name.com/book/003-third-chapter-name.html

In this case, the system administrator (me) is satisfied with the tidy indexing. The content editor (me) is satisfied with the descriptive file names. The users are only marginally satisfied, as 100 filenames is a long row to hoe. (At least the chapters can be ordered sequentially, regardless of operating system, should a user download them.) So I'll have to write some .cgi to give the users means of extracting and re-ordering the table of contents according to their whims.

While this was a fairly minor experiment, taking a couple of months of mulling, and perhaps ten hours of work, to me it strongly suggests that the future of DKRs lies in well-organized but simple hierarchies at the bare metal – which middleware will then organize in entirely different ways for endusers. That is, the brains will be in both the databases and middleware.

In specifying views and nodes, most of the decisions have already been made, I suspect. The transmission protocol comes first (e.g., http.) Then the DN, TLD, and directory structure. The node comes last (e.g., #thisnode). Which leaves the view spec to fit in between the URL proper and the node, as with these (non-working) examples:

[DN.TLD][subdirectory(s)][filename][=viewspec=][#nodename]

www.hastingsresearch.com/net/01-precision-hyperlinks.shtml=v42-a=#4

www.bootstrap.org/ohs/hlinks249.html=v42-a=#4c

It makes sense to have the node specified at the tail end of the URL, since relatively unsophisticated users will be able to change the node without mucking up the URL itself. Presumably by the time a user knows how to specify a view, they are sophisticated enough to modify the innards of a URL.

up arrowNotes

I originally thought the view specification might be done with as little as one letter and one number, giving 26 x 10 choices. Augment, however, uses several characters to describe a view – and if the OHS is to be extensible, the spec should allow for at least thousands of views.

###

Related Works

Designing Good URLs to Improve the Users Experience, Joseph Gannon
A simple, well-thought paper about keeping it simple. (I suggest ignoring the typographical errors.)
http://www.ganemanrussell.com/newsletter/09012001.html

A List Apart: URLS! URLS! URLS!, Bill Humphries
Further thoughts on good URLs, including an introduction to Apache mod-rewrite.
http://www.alistapart.com/stories/urls/

Original References

Open Hyperdocument System Technology Exploration, John Rothermel
http://www.bootstrap.org/augment-133265.htm#3J6

Augment Differentiators, John Rothermel
http://www.bootstrap.org/augment-133247.htm#18G

An Evaluation of the World Wide Web with respect to Engelbart's Requirements, Daniel W. Connolly
http://www.w3.org/Architecture/NOTE-ioh-arch#human-addr

Please send comments to Nicholas Carroll
Email:

Keywords: Open Hyperdocument System, OHS, Bootstrap, Doug, Douglas, Engelbart, Englebart, Augment, NLS, design.

http://www.hastingsresearch.com/net/01-precision-hyperlinks.shtml
© 1999 & 2008 Hastings Research, Inc. All rights reserved.