HRI Home > Technical Papers Index > Why Unicode Won’t Work on the Internet

 

Why Unicode Won’t Work on the Internet:
Linguistic, Political, and Technical Limitations

 

By Norman Goundry
Edited by Nicholas Carroll
Date: June 4, 2001
Modified: N/A
Translations: Russian   Spanish

Summary

Unicode, the semi-commercial equivalent of UCS-2 (ISO 10646-1), has been widely assumed to be a comprehensive solution for electronically mapping all the characters of the world’s languages, being a 16-bit character definition allowing a theoretical total of over 65,000 characters. However, the complete character sets of the world add up to over 170,000 characters. This paper summarizes the political turmoil and technical incompatibilities that are beginning to manifest themselves on the Internet as a consequence of that oversight. (For the more technically inclined: Unicode 3.1 won’t work either.)

Editor’s Note: In the Chinese, both Wade-Giles and Pinyin romanizations are used, depending on which is better known for the particular word. The backgrounders on Oriental languages and politics are rather thorough; readers concerned with the immediate technical implications of the paper may wish to skip directly to "The Inability of Unicode to Fully Address Oriental Characters".

down arrowBackgrounder on Oriental Languages and Characters
down arrowThe Impact of Western Technology on the Orient
down arrowThe Inability of Unicode to Fully Address Oriental Characters
down arrowWhy Unicode 3.1 Does Not Solve the Problem
down arrowThe Political Significance Of This Expressed In Western Terms
down arrowRecent Actions by Verisign
down arrowConclusion

Backgrounder on Oriental Language Characters

China (Chinese)

Chinese is one of the oldest spoken and written languages to be found in use today. Mandarin is spoken by over 1.3 Billion people, and it and the newer, simplified method of writing it used by the people of Mainland China is nothing less than a modification of a process that has been heard in conversations for over two millennia. Many other nations went on to use it in the same manner that it was first used in China. Among these are Japan, Korea, Taiwan, and Vietnam. In the first three, Chinese still forms the backbone of all normal writing and speaking.

Wieger’s seminal book about the characters and construction of Chinese, published in 1915, was to become the defacto source against which all others would (and still should) be compared – with several caveats. Amongst these is a noticeable bias on his part against Taoism which becomes more evident in his analysis of the Tao Tsang (i.e., Taoist Canon of Official Writings [written ‘DaoZang’ in the PinYin Romanization of Mainland China] )

This was due both to his religious and cultural training as a Jesuit Father in China (while it was in the horrendous process of tearing itself away from its thousands of years as a totalitarian state operated by a hierarchy of emperors and imperial bureaucrats), and also because of the common Western prejudice of the day against Oriental culture and society.

Where this slight appears in the subject at hand is Wieger’s setting up, for the first time in popular print, the formalizing of the opinion that there were an enormous amount of "superfluous" characters, both unnecessary and hindering the fast march into the modern age with which China was coming to grips (whether it was wanted or not). The fact of the matter is that this bias, and its glaring ignorance of the real value of such a large amount of so-called "redundancy" continues to this very day, and thus continues to be a chafing-point between Orientals and misguided Westerners.

It should also be known that there were more than a few mistakes, some glaringly apparent, some not, which Weiger identified in his book as "excessive multiplication", in which his distaste becomes more clear: "1. Causes of the excessive multiplication of characters... First, the ignorance of scribes who continually brought to light faulty forms which were stupidly reproduced by posterity; then, the need felt to give names to new things. The Empire was growing, learning was spreading; writing had become a public thing; the process hsing-sheng [phonetic complexes, in which one part has a meaning, while the other points out the pronunciation] being an easy one, all took to it. From this disorderly fermentation, without direction, without control, without criticism, sprang together with useful characters, thousands of useless doubles." To give an overview of what he found so horrifyingly chaotic, the varying amounts of characters are as follows:

From around 800 B.C.E. [Before the Common (Christian) Era], and until the time of about 300-200 B.C.E., the amount of characters in use remained fairly constant; being about 3300 in total. At the end of this era, the amount began to grow at a rapid clip so that one hundred years before the beginning of the first millennium there were about 7380 indexed. This amount ballooned to slightly over 10,000 by the first year C.E.. As the years went on, more and more characters were added to the total until the great Dictionary of K’ang Hsi (completed in 1716 C.E.) codified the set into the state that is seen today.

Weiger states in his book that this (and therefore the entire sanctioned set cited as the final authority since then) "...contains 40,000 characters that may be plainly divided as follows: 4000 characters in common use; 2000 proper names and doubles of limited use; 34,000 monstrosities of no practical use. We are far from the legendary number of 80,000 usual characters, ascribed to the Chinese language." As far as the count goes, the K’ang Hsi does indeed contain nigh onto 40,000 characters in boldface, but in its explanatory texts given along with each of these characters, and the authorized end-supplement of characters left out during the process of its first printing, there are at least several thousand more, so that it is safe to say that Weiger is incorrect and that the normal count is closer to forty-five to fifty thousand total.

The specific size and content of the Communist-authorized set in use today by the people on the Mainland is very hard to pin down – it seems to vary depending upon the circumstances. A major effort began – after the ouster of the Nationalists to Taiwan – to rationalize and modernize the education of the masses, so that China could begin a real attempt to catch up to the nations of the West. Reform of an overall minimum set of characters, sufficient for most common usage and education to high school level, was brought into effect, and Mandarin as spoken in the north was decreed as the first national tongue. Many characters considered to be too complex to write and remember and a great percentage of duplicates were removed completely, so that the list as taught in schools is slightly in excess of 6,000.

A novel and very effective set of Alphanumerics known as Pinyin Romanization was introduced – this ingenious device being very similar in effect to the Romaji of the Japanese, but with the addition of "Accents" which give the "tones" (Mandarin uses four) so necessary for understanding the meaning of the words themselves. Also, a significant portion of the characters which remain have been subjected to the process of JianHua Hanzi ("Simplified Chinese Characters") so that these also are easier to write.

But this last alteration has had a profound effect upon the several generations of students who have now been taught the new set, to the exclusion of the rest of the characters of the past. JianHua Hanzi might as well be an entirely new written language, for it has the effect of denying access to the thousands of years of literature that preceded the Communist takeover in 1949. This has necessitated the re-writing of standard works, including the core of the old Classics, so that they can be studied – retranslation being the tacit sign that such works are "approved" by the government and thus also have an official approval of the thoughts and concepts found within.

Because this "cutting off" of the works of the past has proven to be so severe and in some cases, professionally embarrassing, the demand of the Chinese government that the new 6,000+ core of JianHua Hanzi be included along with an unsanctioned amount of the older, classic Hanzi characters (right up to the full amount if it is someday deemed necessary) is not unreasonable at all, considering the circumstances. And this brings the added effect that, though the basic core of characters taught in the primary through secondary levels of state education has remained somewhat constant, the very fact that Communist China reserves the right to add or subtract or alter from the K’ang Hsi compilation means that attempts in the West to solidify and index the writing system of China will always be likewise unstable.

Taiwan (Chinese)

Taiwan (formerly Formosa) came into view in 1949, when the Nationalist government of Chiang Kai-shek retreated there from the mainland after defeat from the forces of Mao Zedong [Mao Tse Tung] and the Communists. Once an out-of-the-way producer of agricultural products, Taiwan today has one of the most vibrant economies on earth. Since most of the non-aboriginal people who inhabit the island have come from the nearby province of Fujianin in Southeast China, its main attitude remains not only of a total opposition to the Communists and their rule, but also contains a large element of the much older resentment of the takeover by the Manchus from the North area of China from the ruling Ming Dynasty of the South in 1544 C.E. Despite this, the official language of Taiwan is Mandarin (because it is the upper-class language spoken by the government ever since the dismissal of the Mings, and the subsequent establishment of the nation’s capital at Beijing in the North).

Taiwan continues to have extremely strong and close ties with the "Overseas Chinese", who can be found all over the planet, running extended trade and commerce while still maintaining life-lines to Taiwan and Hong Kong. The view that one cannot cut the formal roots of the past with impunity is fiercely held. Taiwan continues to be a bastion for the legacy of pre-Communist China and its ancient past. Along with Korea and Japan, large sections of the population are Buddhist and Confucist in their religious and philosophical outlook. Taiwan has the added condition that even though its major religion is Buddhism, Taoism runs a close second. This means that they wish to be able to access the writings of these three systems, mainly being the "Analects of Confucius" (written down shortly after Confucius’s death in 479 B.C.E.) and its allied corpus of works, the Buddhist Canon (DaZang) being derived from the original Pali Canon written down in the Fifth Century B.C.E., and the Taoist Canon (DaoZang), the writing of which began as early as 300 B.C.E., even though its philosophical roots are much older. (The Taoist Canon alone runs to 1270 volumes of 200 pages of writing and drawings each.)

It is almost a waste of time to say how much impact these three sets of works alone have had upon the past and present makeup of the countries of the Far East. To study these works alone requires the ability to read the classical characters, and that is one of the greatest reasons for the refusal of the Taiwanese to give them up.

Singapore (Chinese)

This tiny country, economically important far beyond its size, uses basically the same system of character writing; using Mandarin as its official spoken language, the same rules apply to it as do those of Communist China. In fact, Singapore is the only other country to have allied itself so closely in this way on an independent basis, with the PinYin and JianHua Hanzi being taught in the schools along side a deep regard for the classical K’ang Hsi-based, full character structure of the past.

Korea (Korean)

"The continued use of Chinese Characters in [the] Japanese and Korean languages has led to a widespread misconception that there is a close relationship among these three languages. A closer look reveals that the similarity ends with the borrowing of characters when no writing system existed and the continued use of ’loan’ words in Korean and Japanese from Chinese.

"Historically, the close cultural association between China and Korea led to the inevitable borrowing of words. However Korean grammar and inflection are totally different from Chinese. In fact Bruce Grant stated in his introduction to A Guide To Korean Characters, "Chinese and English have more in common than do Chinese and Korean. Korean is most likely a member of the Ural-Altaic family of languages and is similar to Japanese; it is interesting to note that Finnish is also a sub-member of the group" [Quoted from Korean With Chinese Characters 1, by Richard B. Rucci]

Note that what is being articulated in the above reference is the use of the spoken language rather than the written ideographics (the regular Chinese "characters" which are in the most case, pictographic rather than phonetic, these being called Hancha by the Koreans). The Koreans did create their own phonetic-based written language, Hangul, in 1446 C.E., and it can be considered to be a most brilliant construction, even to this day. Technically, it was designed from the start to be able to describe any sound the human throat and mouth is capable of producing in speech, and to do this in the space of not more than what can be written with clarity, in a 24 X 24[dot-per-inch] space.

However, up until the most recent of times, about 60% of the total vocabulary was still made up with words borrowed from Chinese. After the release of Korea from Japanese control in 1945, and even more so following the great inflow of things Western brought by the allies during the Korean Conflict, a trend was set which continues to this day, that being the reliance more and more upon the speed and simplicity of the phonetic Hangul.

This recent span of time is only a brief blip in the total existence of Korean writing and literature. It certainly predates the Japanese use of a formalized system of writing, since the latter learned of the Chinese characters through contact with the Korean court, and Confucist and Buddhist scholars, slightly before 100 C.E. On the other hand Korea can certainly prove to have been using the ordinary Chinese written language from at least the beginnings of the Warring States Period (403-221 B.C.E.) in China, when the country-wide carnage and destruction forced migration on those who could not (or would not) survive through sheer physical ability and cunning. Many sought refuge in more peaceful climes – the Korean Peninsula being such a haven throughout the many decades of constant fighting.

These days, it is common for newspapers and sub-headings on foreign television to be printed entirely in the phonetic language of Hangul. But in education it still remains that middle school graduates need to be proficient in about 900 Chinese characters, and those going on to completion of high school need to learn another 900, bringing the total to 1800.

Only being literate in use of Hangul is certainly not a full literacy. Korean scholars say that it requires a fluency level much greater than this amount to understand the writing of the past. (This is often thought of as being prior to the 1945 liberation by the Russians in the north and the Western forces in the south from Japanese occupation. More precisely, the past should be considered the time prior to the beginning of that occupation, in 1910, when the use of Korean writing and language were forbidden by Imperial law).

Korean scholars rightfully insist that true literacy is having the ability to the read works of all subjects from these writers of the past, and such things generally contain a balance of no more than 30% Hangul to 70% Chinese characters. Colleges and universities have always known this fact, and even nowadays these institutions demand the use of the 70-30 percent split in all writing generated there in. On the other hand, Hancha in newspapers is now officially limited to around the amount that is learned in high school, so that uniformity of understanding can be achieved in the normal populace.

Another area of contention is in the use of names. Even though it is now common to see Hangul used to explain a person’s name, people still take great pride in being able to write their name in the classic way, and this means a more than simple understanding of Hancha (and its attendant use of calligraphy) is necessary to be able to not appear uneducated in such matters.

Japan (Japanese)

Japan is a special case in the use of Han characters, as the use of written language in that country has a level of complexity which even surpasses that of China. In 1946, the newly installed government issued a decree that there would henceforth be an official base of 1,850 Kanji (the Japanese pronunciation for the Chinese Hantzu characters it uses). Known as the Toyo Kanji (that is to say "daily use" Kanji), notable in this decree was the statement that from that time on, the given [i.e.,personal] names of all Japanese could only be taken from it and no other source. This was also the approved, limited set of Kanji to be used by the Press. As such a severe change soon proved to be too onerous, the list was subsequently amended a few years later (1951) to allow an additional 92 characters for use in proper names. Also, 28 characters were added into the main body of the 1,850 Toyo Kanji, these being generally used and recognized abbreviations and redundant variants (with an exact amount of 28 characters being accordingly removed from the main body so that the amount of 1,850 could remain as a constant). However, the Toyo Kanji could not hope to also cover the use of family [i.e., surnames] and place names. These run into the tens of thousands; the different possibilities boggle the mind. Also note that the total of 1850 characters has recently (1977) been altered again, and now numbers 1,950 characters in total, this being known as Kyoiku Kanji (or "educational" Kanji).

This is only the beginning of what must be one of the most complex and intensive systems of writing in the world. But first a brief historical tour, so that some of the reasons for this underlying entanglement can be understood.

As was the case with Korea, Japan's spoken language was not represented in the earliest form of writing. It was normal Chinese characters (Hantzu) exclusively. Evidence of it being used dates as early as 100 C.E. A bit later on, it was introduced into the country by two Korean scholars, Wang In and Ajikki, who were sent to the imperial court to act as teachers, during the third century C.E. Dictionaries were sent over in 285, so this date can be considered that of the formal introduction of writing and its structure.

Buddhism arrived in 552, and along with it the many texts and tenets of its Canon. Monks were considered to be the same as teachers, and reading and writing was a necessity to further study and enlightenment, the veneration and respect being given to written materials and learning which was exceeded only by that of the Koreans who initially gave it to them.

Here the similarity ends. Japan has four different types of writing. There is the original Kanji, and two others which are phonetically based, these being Hiragana and Katakana. Also, there is Romaji, which is the Latin-based characters we are familiar with in the West. Kanji can be used to form "pictorial" glyphs alongside its use as a source of sounds, much as it is in Chinese. The syllabaries, Hiragana and Katakana constitute fully functional writing schemes in themselves. Hiragana, which is somewhat cursive, can be used to augment Kanji – in fact, everything in Kanji can be written in Hiragana. Katakana, which is much more fluid in appearance than is Hiragana, is used to write any word which does not have its roots in Kanji, such as the many foreign words and ideas which over the centuries have drifted into general use.

Thus is can be said that Hiragana can form pictures but Katakana can only form sounds, and modern science has borne this out. People with certain brain disorders or actual physical damage can sometimes recognize and function in one and not the other, as these methods operate out the two different hemispheres. Romaji is used to try and keep the whole written thing from getting out of control, with most Western concepts and necessary words being introduced into the language through this mechanism. After a time these words (even though they will still maintain their "Roman" form for awhile longer) will become unrecognizable to the people they were originally borrowed from, such as the phrase, "Personal Computer," which is now "pasokon" or "persacom" in Japan (the latter being common in Nagasaki and adjoining areas).

Before the onslaught of English over the last few decades, it was found that 41% of the words in use in common conversation and writing were based on Chinese (in the form of characters and sounds). As one ascends higher into the realms of government and academia, this percentage increases accordingly. The increases as taught in school are as follows. 850 Kanji are taught in elementary school, 46 characters in Grade One, 105 in Grade Two, 187 in Grade Three, 205 in Grade Four, 194 in Grade Five, and 144 in Grade Six. The rest of the 1,950 have to be memorized fully by the time of graduation from high school in Grade Twelve. Please remember that this total is only the legal minimum required threshold to be considered literate. And this is to be absorbed completely, along with a back-breaking load of other subjects.

To be considered a serious reader of the "Classics" of Japanese literary and religious works requires a full knowledge just as deep and as wide as that of the scholars of China. A minimum of 10,000 characters and up is mandatory, and in total can be logically extended to the end of the full Kanji (K’ang Hsi) Dictionary with its 50,000 distinct ideographs.


up arrowThe Impact of Recent Western Technology On the Orient

More change has taken place in China in the last five years than in the previous fifty, and that fifty contains more change than in the last thousand. This cannot be said of Japan and Korea (only because they started earlier, and thus have achieved a state of frenetic transformation which is ongoing, rather than having just recently been abruptly awoken into a state of complete shock at finding itself running in place, full-bore).

Up until the arrival of the Internet several years ago, using a personal computer in Japan was considered to be the mark of abnormal behavior – in a country which abhorred anything outside the norm. There is an old Japanese saying: "The nail which sticks up gets pounded down". This means that the norm consists of striving to be just like everyone else in society, and not allowing oneself to somehow become an "Individual". It cannot be overstated how deeply ingrained this concept is, even today.

Personal computers were exactly as the name implies: something which was used by oneself alone, and hence segregated one from the rest of the group (consisting of many groups within groups) – and this action would eventually lead one to become an outsider and then even an alien. Cut-off ensued, and this would become a state of gradual decline and eventual exclusion even from one’s own self. Thus the Japanese will traditionally find it hard to do anything which leads to such exclusion, while the Chinese (and to a lesser extent the Koreans) do not suffer from this problem at all. In China, one is always a Chinese unconditionally, having a family and a village, no matter how far away one is, or how many generations one has been away.

But, returning to the problem facing that hypothetical individual in Japan: for a long time, having to use a computer considered was a form of punishment or torture given to those who were damned by their status in the Work Force, or a lunatic-fringe artist or scientist who would was probably already shunned by others for a long time anyway, before they even got access to their first keyboard.

The arrival of the Internet changed all of this for once and for all. The term "Internet" means "that which is interconnected", and that is, of course, completely alright with Japanese society. One could see it develop along with the cellular telephone, which also facilitated "connection" one to one’s groups in an interlocking way, being based upon a matrix among matrices intersecting the city and country and eventually the entire planet). Just as cell phone use is extremely high in Japan, so is the integration of the PC into nearly every other home. It is cheap, fast, reliable – and it is cozy.

But in the use of this technology, the Japanese suffer from the same problem as do the Chinese and Koreans, to wit: how do you shoe-horn so many characters into an input device (keyboard, tablet, what-have-you) so that you can do what others in the West do with their simple set of alphanumerics we had passed along to us from the Romans? The keyboard was designed for us in the West. So was the standard monitor and the teletype-based printer.

It is no coincidence that these devices are now primarily manufactured in the Orient (with apologies to Hewlett-Packard and their successful line of North-American built printers), and the main reason is that the quality level which most of us would put up with, such as a low-resolution, 40 character-per-line green monitor, and a single-pin printer were totally unusable to people needing generation of high-resolution characters in a vertical mode of 24x24 D.P.I. The same thing goes for the printer. Epson came out with an eight-pin printer so that it could generate Hiragana and Katakana characters in one pass – not so that we could make nicer A's and B's. They also gave the printers "Graphics" modes so that "pictures" (most generally hand-writing in the case of its Asian customers) could be printed.

That capacity for graphics is likewise one of the main reasons why the fax machine so quickly became a common fixture. It could reproduce and transmit the hand-writing of Chinese and Japanese and Korean characters.

Eventually, with much nudging along in the territories of high-resolution color and graphics, better input devices such as the scanner (which can be thought of a fax machine for computers), better output devices such as the inkjet and laser printer, and even bastardized keyboards and software which could generate thousands of characters – if only one can remember each and every one of the input codes. Graphics tablets eased the pain of having to get something into and out of the computer. But none of this is yet fully satisfactory, and perhaps it will remain in this state until the intelligent, voice-understanding, "computer" finally comes into our daily lives.



up arrowThe Inability Of Unicode To Fully Address Oriental Characters

Regardless of all of this, the growth of the World Wide Web is upon us and all others upon this planet. The current philosophy is contained in the belief that "English is the new Lingua Franca of business" – so it just might as well also be the language of everyone who uses the Web.

Let me rephrase that somewhat: English is easily the language of the Web, but not necessarily that of the Internet. The two are not mutually inclusive, as most people assume. This is an unfortunate flaw in Western attitudes. It extends into the basics of the operating system, and has now been allowed to intrude into the structure and tools upon which the Web is built

ISO and Unicode have attempted to rectify this flaw. As specified, Unicode's stated purpose is to allow a formalized font system to be generated from a list of placement numbers which can articulate every single written language on the planet.

Unfortunately, it cannot, without extensive gymnastics.

The current permutation of Unicode gives a theoretical maximum of approximately 65,000 characters (actually limited to 49,194 by the standard). This at first seemed like more than enough to the brave souls who set up the formal ranging of a very long consecutive string of numbers to which characters of different languages are assigned. It was a good idea, in camera – except to the nations who were not invited to the initial party.

These non-invitees included the groups with the most characters to assign. In fact, these particular rejects were none other than Mainland China, Taiwan, Korea, and Japan.

The reaction was predictable, and in my view justified. Mainland China has insisted that all of its normal, official 6,000 characters be included, along with the many "simplified" variations, plus the rest of the older, classic K’ang Hsi set of 40,000+ characters. This alone is enough to take up almost all of the space allotted in the entire Unicode/UCS-2 spectrum.

Then Taiwan and the overseas Chinese (of whom there are 125 million, generally well placed and well-educated people) stated that they had the rights to their own complete set of K'ang Hsi characters – all of them in their original complex forms. This was an addition of another 50,000 characters, and they could not use the same numbering as those assigned over to the Communists on the Mainland.

Between the two groups, there was now the need to generate over 90,000 individual numbered placements. Japan complained and said that it was no less a owner of its own characters (including "Kokuji", which are characters which appear to be Chinese-derived, but are actually uniquely Japanese), and so there should be another block set up for them. And since this could theoretically include all of the characters used up until now, another 40,000+ placements would be needed. And finally, not to be left out of the circle of legitimate claimants, Korea, because its own set of past and present circumstances, asked for its full measure too.

These are just some of the many reasons the amount needed to satisfy such requirements could very easily be taken to a total of over 170,000 characters, if every one of the nations listed above continues to push their written language rights to the maximum – and there is absolutely no reason to expect any change in their desire to do so.

Editor’s Notes:

1. As best as I can tell – questioning some of the pioneers in ARPAnet and transmission protocols – the astute ones were fully aware of the need to eventually accommodate Oriental characters, as much as 30 years ago. The trouble was, they would ask one Chinese or Japanese or Korean – and that person, looking at the character set of their own language, would assure them that Unicode would suffice. It is only when you get all the nationalities in the same room that the problem becomes manifest. And with the Internet, we’re now all "in the same room."

2. A further source of oversight comes from the tendency of many Westerners to dismiss older Oriental characters as "classic," when in fact they are still in use for precisely that reason – reading classic literature.


up arrowWhy Unicode 3.1 Does Not Solve the Problem

Unicode recently announced version 3.1, which – breaking out of the two "Plane Zero" octets they had originally allowed themselves in version 3.0, with 49,194 characters – would add another two octets and another 44,946 characters to the scheme, for a grand total of 94,140.

This still falls woefully short of the 170,000+ characters needed.

Clearly, 32 bits (4 octets) would have been more than adequate if they were a contiguous block. Indeed, "18 bits wide" (262,144 variations) would be enough to address the world’s characters if a contiguous block.

But two separate 16 bit blocks do not solve the problem at all.


up arrow The Political Significance Of This Expressed In Western Terms

To express it in Western terms, how would English-speakers like it if they were suddenly restricted to an alphabet which is missing five or six of its letters because they could be considered "similar" (such as "M" and "N" sounding and looking so much like each other) and too "complex" ("Q" and "X" – why, they are the nothing more a fancier "C" and an "Z"). One could further the analogy by saying English should give up about three out of every four words that are found in the English language, on the grounds that they are redundant, too arcane, or merely superfluous, and modern speech does not either need or use them. This would be the end of both the Bible and Shakespeare.

One must further consider remaining animosities stemming from centuries of warfare. In this sense the Orient is little different from Europe; the furor that has arisen over the EC [European Community] changing to a common currency (the Euro) would be nothing compared to the uproar that would ensue if the French were compelled to use the German alphabet, or the English compelled to use a French alphabet. Nor would the issue be purely emotional. Such changes would be more than an annoyance, indeed they would be a threat to one’s very language and way of thought.

The analogy can easily be taken further, if one considers the political tensions in recent years as various nations were denied (and sometimes later granted) membership in the EC. In a similar vein, to have your language left out of the Internet is definitely a case of being "denied membership."


up arrowRecent Actions by Verisign

Verisign recently opened a Pandora’s Box when the company stated that it was taking orders for URLs in the language particular to those countries which either desire or demand to work in a written set other than Latin1.

The company backed away somewhat at the howls of fear and anger from those who know this cannot possibly work without causing great distress to those who have to manage and work the World Wide Web.

Also, some of the countries reject this as impertinence on the part of Verisign, considering this an insult to their efforts at maintaining the sovereignty of the state. China is a major country to come out and say so, rejecting such attempts as meddling in their own internal affairs. Perhaps they are right.

The same truism can be applied not just to URLs but to the Internet itself. There are no proper tools coming from the West to allow Webs that work internationally, and browsers that really are transparent and seamless in everyday usage for this segment of the future. Ask anyone who has to use one, and wants to do anything other than what can be generated with pseudo-ascii (such as French, or German, or Albanian) – or needs characters which are vertically aligned, and need to occupy a minimum of 32 X 32 dots for each.) To continue to believe that interfacing the World Wide Web can be done with Ascii-dependent browsers and – even more importantly, Ascii-dependent servers – is naive.



up arrowConclusion

UCS-2 (with 2-octet blocks per character) indeed seems to be the most straightforward system for character use (and the one which follows Unicode's original intentions most faithfully) – excepting that, as previously stated, it has far too short an overall address length to encompass all known characters of all known languages.

Shifting attention to other Unicode-certified methodologies for doing the same things, there are UTF-8, UTF-16, and UTF-32. To quote Unicode’s paper: "Different encoding forms of Unicode are useful in different system environments. For example, UTF-32 is somewhat simpler in usage than UTF-16, in almost all cases occupies twice the storage. A common strategy is to have internal string storage use UTF-16 or UTF-8, but to use UTF-32 for individual character datatypes."

This is fine; in fact most computer applications operate in such a fashion already, and did so before Unicode. The problem is that – even in the simple explanation of what is overtly a simple problem – no less than three separate codifying formulas are brought to bear to answer it. One can easily formulate new standards using 4 octet blocks (ad infinitum) – but piggybacking them on top of Unicode 3.1 simply exacerbates the complexity of font mapping, as Unicode 3.1 has increased the complexity of UCS-2.

So this, in a nutshell, is the politically explosive future we now face.

The Chinese have an ancient expression: "Nothing is more powerful than an idea whose time has come."

The time has come. The question is now: what will this idea grow up to be?

###


Norman Goundry is a computer programmer, translator, and reference writer specializing in rare Taoist religious texts and medical works. He can usually be found buried deep in the restricted-entry catacombs of the Asian Studies Department of U. of British Columbia, working with the rare Taoist Canon texts found therein. He expresses this personal experience with the limits of Unicode: "I have recently had to design a single proprietary font consisting of over 50,000 individual Han Complex Characters as per those given in the Kang Hsi Dictionary of 1710 for my own hand-programmed translation interface, because of the constant frustration over not having a particular character available for use when it is needed. I looked carefully at Unicode and then rejected it, because it does not to my knowledge contain even one single full representative font indexing of the characters needed for spanning any of the above mentioned groups."


Translations

Russian translation by Donna Barrier at science.eduboard.com.

Spanish language translation by Maria Ramos at www.webhostinghub.com.


References

Chinese Characters, by Dr. L. Wieger, S.J.

Korean With Chinese Characters 1, by Richard B. Rucci

The Modern Reader's JAPANESE-ENGLISH CHARACTER DICTIONARY,
by Andrew Nathaniel Nelson, Ph.D
Charles E. Tuttle Company: Tokyo (1962)

Emperor Kang-Hsi's Character Dictionary,
(full revision of the original of 1716 – in Chinese only)
Yih Mei Book Company, Hong Kong

The Basic English-Chinese / Chinese-English Dictionary
by Peter M. Bergman
Signet-New American Library Press, New York (1980)

The World Chinese-English / English-Chinese Dictionary
New Arts Company, Hong Kong

Please send comments to

http://www.hastingsresearch.com/net/04-unicode-limitations.shtml
Copyright © 2001 Norman Goundry. All rights reserved.