Unicode

Outline

Introduction
What Unicode Includes
Unicode Encodings
Sources of Information
Tools

ASCII is by far the most commonly used character encoding because it suffices for normal English text and English has long been the dominant (natural) language used on computers. As other languages came into use on computers, other sets of characters, with different encodings, came into existence. Indeed, there is usually more than one encoding for a particular writing system. All in all, there are hundreds of different character encodings.

This proliferation of character encodings causes a lot of problems. If you receive a document from someone else, your software may not be able to display it, print it, or edit it. You may not even be able to tell what language or writing system it is in. And if you need to use multiple writing systems in the same document, matters become much worse. Life would be much simpler if there was a single, universal encoding that covered all of the characters in all of the writing systems in use.

Unicode is a character encoding standard developed by the Unicode Consortium to fulfill this need. It attempts to include in a single encoding, using a single sequence of numbers, all of the characters in all of the writing systems that anyone is likely to want to use. Some aspects of Unicode have come in for criticism, and there are some alternative proposals, but at least for now it is by far the most widely adopted universal encoding.

What Unicode Includes

The current version of the Unicode standard contains almost all of the writing systems currently in use, plus a few extinct systems, such as Linear B. More writing systems will be added in the future. The following chart lists the character ranges that have thus far been defined. The name of the range links to the appropriate section of the Unicode standard on the Unicode web site. These are PDF files. The beginning of the range is a link to an HTML chart.

0000	007F	Basic Latin
0080	00FF	C1 Controls and Latin-1 Supplement
0100	017F	Latin Extended-A
0180	024F	Latin Extended-B
0250	02AF	IPA Extensions
02B0	02FF	Spacing Modifier Letters
0300	036F	Combining Diacritical Marks
0370	03FF	Greek/Coptic
0400	04FF	Cyrillic
0500	052F	Cyrillic Supplement
0530	058F	Armenian
0590	05FF	Hebrew
0600	06FF	Arabic
0700	074F	Syriac
0750	077F	Undefined
0780	07BF	Thaana
07C0	08FF	Undefined
0900	097F	Devanagari
0980	09FF	Bengali/Assamese
0A00	0A7F	Gurmukhi
0A80	0AFF	Gujarati
0B00	0B7F	Oriya
0B80	0BFF	Tamil
0C00	0C7F	Telugu
0C80	0CFF	Kannada
0D00	0DFF	Malayalam
0D80	0DFF	Sinhala
0E00	0E7F	Thai
0E80	0EFF	Lao
0F00	0FFF	Tibetan
1000	109F	Myanmar
10A0	10FF	Georgian
1100	11FF	Hangul Jamo
1200	137F	Ethiopic
1380	139F	Undefined
13A0	13FF	Cherokee
1400	167F	Unified Canadian Aboriginal Syllabics
1680	169F	Ogham
16A0	16FF	Runic
1700	171F	Tagalog
1720	173F	Hanunoo
1740	175F	Buhid
1760	177F	Tagbanwa
1780	17FF	Khmer
1800	18AF	Mongolian
18B0	18FF	Undefined
1900	194F	Limbu
1950	197F	Tai Le
1980	19DF	Undefined
19E0	19FF	Khmer Symbols
1A00	1CFF	Undefined
1D00	1D7F	Phonetic Extensions
1D80	1DFF	Undefined
1E00	1EFF	Latin Extended Additional
1F00	1FFF	Greek Extended
2000	206F	General Punctuation
2070	209F	Superscripts and Subscripts
20A0	20CF	Currency Symbols
20D0	20FF	Combining Diacritical Marks for Symbols
2100	214F	Letterlike Symbols
2150	218F	Number Forms
2190	21FF	Arrows
2200	22FF	Mathematical Operators
2300	23FF	Miscellaneous Technical
2400	243F	Control Pictures
2440	245F	Optical Character Recognition
2460	24FF	Enclosed Alphanumerics
2500	257F	Box Drawing
2580	259F	Block Elements
25A0	25FF	Geometric Shapes
2600	26FF	Miscellaneous Symbols
2700	27BF	Dingbats
27C0	27EF	Miscellaneous Mathematical Symbols-A
27F0	27FF	Supplemental Arrows-A
2800	28FF	Braille Patterns
2900	297F	Supplemental Arrows-B
2980	29FF	Miscellaneous Mathematical Symbols-B
2A00	2AFF	Supplemental Mathematical Operators
2B00	2BFF	Miscellaneous Symbols and Arrows
2C00	2E7F	Undefined
2E80	2EFF	CJK Radicals Supplement
2F00	2FDF	Kangxi Radicals
2FE0	2EEF	Undefined
2FF0	2FFF	Ideographic Description Characters
3000	303F	CJK Symbols and Punctuation
3040	309F	Hiragana
30A0	30FF	Katakana
3100	312F	Bopomofo
3130	318F	Hangul Compatibility Jamo
3190	319F	Kanbun (Kunten)
31A0	31BF	Bopomofo Extended
31C0	31EF	Undefined
31F0	31FF	Katakana Phonetic Extensions
3200	32FF	Enclosed CJK Letters and Months
3300	33FF	CJK Compatibility
3400	4DBF	CJK Unified Ideographs Extension A
4DC0	4DFF	Yijing Hexagram Symbols
4E00	9FAF	CJK Unified Ideographs
9FB0	9FFF	Undefined
A000	A48F	Yi Syllables
A490	A4CF	Yi Radicals
A4D0	ABFF	Undefined
AC00	D7AF	Hangul Syllables
D7B0	D7FF	Undefined
D800	DBFF	High Surrogate Area
DC00	DFFF	Low Surrogate Area
E000	F8FF	Private Use Area
F900	FAFF	CJK Compatibility Ideographs
FB00	FB4F	Alphabetic Presentation Forms
FB50	FDFF	Arabic Presentation Forms-A
FE00	FE0F	Variation Selectors
FE10	FE1F	Undefined
FE20	FE2F	Combining Half Marks
FE30	FE4F	CJK Compatibility Forms
FE50	FE6F	Small Form Variants
FE70	FEFF	Arabic Presentation Forms-B
FF00	FFEF	Halfwidth and Fullwidth Forms
FFF0	FFFF	Specials
10000	1007F	Linear B Syllabary
10080	100FF	Linear B Ideograms
10100	1013F	Aegean Numbers
10140	102FF	Undefined
10300	1032F	Old Italic
10330	1034F	Gothic
10380	1039F	Ugaritic
10400	1044F	Deseret
10450	1047F	Shavian
10480	104AF	Osmanya
104B0	107FF	Undefined
10800	1083F	Cypriot Syllabary
10840	1CFFF	Undefined
1D000	1D0FF	Byzantine Musical Symbols
1D100	1D1FF	Musical Symbols
1D200	1D2FF	Undefined
1D300	1D35F	Tai Xuan Jing Symbols
1D360	1D3FF	Undefined
1D400	1D7FF	Mathematical Alphanumeric Symbols
1D800	1FFFF	Undefined
20000	2A6DF	CJK Unified Ideographs Extension B
2A6E0	2F7FF	Undefined
2F800	2FA1F	CJK Compatibility Ideographs Supplement
2FAB0	DFFFF	Unused
E0000	E007F	Tags
E0080	E00FF	Unused
E0100	E01EF	Variation Selectors Supplement
E01F0	EFFFF	Unused
F0000	FFFFD	Supplementary Private Use Area-A
FFFFE	FFFFF	Unused
100000	10FFFD	Supplementary Private Use Area-B

Each block of 65,636 codepoints is referred to as a plane. Planes are numbered beginning at 0. Plane 0, codepoints 0x0000 through 0xFFFF, is known as the Basic Multilingual Plane or BMP because it contains the great majority of characters in current use for the world's languages.

Most of the Unicode ranges represent a single writing system. However, this is not always the case. In some cases Unicode lumps together several writing systems. For example, what it calls the Canadian Aboriginal Syllabics is not a single writing system. It is actually the union of the Cree writing system, the Inuktitut writing system, several variants used for languages such as Slave, Dogrib, and Dene Souline (Chipewyan), and the historically related but quite different Carrier writing system. Bengali and Assamese are combined since they differ only in the use of an additional character in Assamese and in the shapes of one letter. The Chinese characters used for Chinese, Japanese, Korean and Vietnamese are combined into a single set referred to as "CJK characters".

Languages written in a combination of writing systems, such as Japanese, which is typically written in a mixture of Chinese characters, hiragana, and katakana, will make use of multiple ranges. However, a language need not make use of multiple writing systems for it to draw characters from multiple Unicode ranges. A text in a language written in a non-Roman writing system will almost always contain characters from at least two ranges. This is because whitespace characters such as space and line feed, the arabic numbers, and European punctuation are widely used. These characters, which are included in the Basic Latin range, are not repeated in the other ranges. For example, here is a bit of what we would think of as pure Tamil text: இல்லையே, இது வரைக்கும் பேசவேயில்லையே. However, it actually contains several characters from the Basic Latin range because the spaces and punctuation are Basic Latin. Here is a listing of the character ranges in this text:

Here is a listing of the individual characters:

Offset	UTF-32	Range and Name
0	0x00B87	TAMIL LETTER I
1	0x00BB2	TAMIL LETTER LA
2	0x00BCD	TAMIL SIGN VIRAMA
3	0x00BB2	TAMIL LETTER LA
4	0x00BC8	TAMIL VOWEL SIGN AI
5	0x00BAF	TAMIL LETTER YA
6	0x00BC7	TAMIL VOWEL SIGN EE
7	0x0002C	BASIC LATIN COMMA
8	0x00020	BASIC LATIN SPACE
9	0x00B87	TAMIL LETTER I
10	0x00BA4	TAMIL LETTER TA
11	0x00BC1	TAMIL VOWEL SIGN U
12	0x00020	BASIC LATIN SPACE
13	0x00BB5	TAMIL LETTER VA
14	0x00BB0	TAMIL LETTER RA
15	0x00BC8	TAMIL VOWEL SIGN AI
16	0x00B95	TAMIL LETTER KA
17	0x00BCD	TAMIL SIGN VIRAMA
18	0x00B95	TAMIL LETTER KA
19	0x00BC1	TAMIL VOWEL SIGN U
20	0x00BAE	TAMIL LETTER MA
21	0x00BCD	TAMIL SIGN VIRAMA
22	0x00020	BASIC LATIN SPACE
23	0x00BAA	TAMIL LETTER PA
24	0x00BC7	TAMIL VOWEL SIGN EE
25	0x00B9A	TAMIL LETTER CA
26	0x00BB5	TAMIL LETTER VA
27	0x00BC7	TAMIL VOWEL SIGN EE
28	0x00BAF	TAMIL LETTER YA
29	0x00BBF	TAMIL VOWEL SIGN I
30	0x00BB2	TAMIL LETTER LA
31	0x00BCD	TAMIL SIGN VIRAMA
32	0x00BB2	TAMIL LETTER LA
33	0x00BC8	TAMIL VOWEL SIGN AI
34	0x00BAF	TAMIL LETTER YA
35	0x00BC7	TAMIL VOWEL SIGN EE
36	0x0002E	BASIC LATIN FULL STOP

Many languages written in extended versions of the Roman alphabet will also draw characters from several ranges. The Basic Latin range includes the basic twenty-six letters with no diacritics. If a language uses accents or other diacritics, or if it includes additional characters, it will draw characters from the Latin-1 Supplement or one of the two Latin Extended ranges. For example, Polish makes use of most of the ordinary Roman letters as well as characters such as Ł, which belongs to the Latin Extended-A range.

The Private Use Areas allow for the inclusion of non-standard characters in Unicode text. Any group of people can agree to use a certain encoding for a certain set of characters and exchange documents using them without fear of conflict between standard Unicode characters and their private character set. If a document contains characters in one of these ranges, one will not be able to display them or manipulate them intelligently unless one knows what they are. However, software processing such a document can simply be told to ignore characters in Private Use Areas.

One current use of the Private Use Areas is for writing systems that may well eventually be included in the Unicode standard but have not yet been included. For example, yudit supports the Hungarian runes and the Klingon alphabet, both encoded in the Private Use Area, Both of these writing systems may eventually be included in the standard. Another use for the Private Use Areas is for writing systems so obscure that they may never be included in the standard.

Unicode Encodings

UTF-32

Unicode originally intended to use two bytes, that is, 16 bits, to represent each character. That would be sufficient for 65,536 characters. Although this may seem like a lot, it isn't really quite enough, so full Unicode makes use of 32 bits, that is, four eight-bit bytes. That's enough for 4,294,967,296 characters. In fact, although a 32 bit representation is used, the current standard actually calls for the use of only 21 bits - the high 11 bits are always 0. This provides for 2,097,152 characters, which should still be plenty. Text encoded in this version of Unicode is said to be in UTF-32.

UTF-16

When it was first realized that more than 65,536 characters might be needed, an attempt was made to expand the character space while keeping what was basically a two-byte encoding. The result was UTF-16. UTF-16 adds a complication: surrogate pairs. The ranges 0xD800-0xDBFF, the High Surrogate Area, and 0xDC00-0xDFFF, the Low Surrogate Area, do not directly represent characters. Instead, pairs of values, one a high surrogate, the other a low surrogate, together encode a character. The low ten bits of the high surrogate are concatenated with the low ten bits of the low surrogate, yielding a 20 bit number. Such surrogate pairs can encode 1,048,576 additional characters. UTF-16 can therefore encode a total of 65,536 -2048 + 1,048,576 or 1,112,064 characters. The characters in the BMP are represented by two bytes; characters outside the BMP are represented by four bytes.

UTF-8

One problem with UTF32 is that every character requires four bytes, that is, four times as much space as the ASCII characters and other single-byte encodings. In order to save space, a compressed form known as UTF-8 is usually used to store and exchange text. UTF-8 uses from one to four bytes to represent a character. It is cleverly arranged so that ASCII characters take up only one byte. Since the first 128 Unicode characters are the ASCII characters, in the same order, a UTF-8 file containing nothing but ASCII characters is identical to an ASCII file. Other characters take up more space, depending on how large the UTF-32 code is. Here are the encodings of some of the characters shown above. The 0x indicates that these are hexadecimal (base 16) values.

UTF-32	UTF-8	Name
0x00041	0x41	Latin capital letter a
0x00570	0xD5 0xB0	Armenian small letter ho (հ)
0x00BA4	0xE0 0xAE 0xA4	Tamil letter ta (த)
0x04E09	0xE4 0xB8 0x89	Chinese digit 3 (三)
0x10024	0xF0 0x90 0x80 0xA4	Linear B qe (𐀤)

The first byte of the UTF-8 encoding of a character contains the information about how many additional bytes are used to encode it. If the high bit of the first byte is 0, the characteris an ASCII character and no additional bytes are used to encode it. If the high bit is 1, at least one additional byte is part of the encoding. The number of adjacent bits set starting with the high bit is the total number of bytes used to encode the character. For example, if the top three bits are 110, the character is encoded using two bytes. The first byte therefore consists of from zero to six 1s followed by a 0. The remaining bits can be either 1 or 0 and contribute to the encoding of the character.

The following chart shows how characters in different ranges are encoded in UTF-8. The letter n represents a bit that contributes directly to the encoding; it can be either 0 or 1.

UTF-32 Code	UTF-8 Encoding
Range	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5	Byte 6	Bits	Slots
00000000 - 0000007F	0nnnnnnn						7	128
00000080 - 000007FF	110nnnnn	10nnnnnn					11	2,048
00000800 - 0000FFFF	1110nnnn	10nnnnnn	10nnnnnn				16	65,536
00010000 - 001FFFFF	11110nnn	10nnnnnn	10nnnnnn	10nnnnnn			21	2,097,152
00200000 - 03FFFFFF	111110nn	10nnnnnn	10nnnnnn	10nnnnnn	10nnnnnn		26	67,108,864
04000000 - 7FFFFFFF	1111110n	10nnnnnn	10nnnnnn	10nnnnnn	10nnnnnn	10nnnnnn	31	2,147,483,648

How this works is most easily seen by examining specific examples. Here are the bit patterns of the same characters as above. In the UTF-8 encoding a hyphen separates the bits that directly contribute to the encoding from the preceding bits.

Name	UTF-32	UTF-8
Latin capital letter a	00000000000000000000000001000001	0-1000001
Armenian small letter ho	00000000000000000000010101110000	110-10101 10-110000
Tamil letter ta	00000000000000000000101110100100	1110-0000 10-101110 10-100100
Chinese digit 3	00000000000000000100111000001001	1110-0100 10-111000 10-001001
Linear B qe	00000000000000010000000000100100	11110-000 10-010000 10-000000 10-100100

For example, take Chinese digit 3, encoded as 1110-0100 10-111000 10-001001. Stripping off the bits at the beginning that are not directly part of the encoding, we obtain: 0100 111000 001001. Concatenating these and padding out to 32 places by adding 0s on the left, we obtain: 00000000000000000100111000001001, which is the UTF-32 encoding.

Although in principle we could encode 4,294,967,296 different characters in 32 bits, UTF-8 can only encode 2,216,757,376 characters in six bytes. This is unlikely to be a problem in practice. But if we really did need more than 2,216,757,376 characters, we could use a seventh byte, with the first byte set to 11111110. This would give us 36 useful bits, for an additional 68,719,476,736 slots, allowing us to encode a total of 70,936,234,112 characters. This is considerably more than can be represented in UTF-32.

Notice that it is not only the first byte in the UTF-8 encoding of a character whose upper bits play a special role. The top two bits of all non-initial bytes must be 10. This seems to be a waste, since it means that each additional byte only contributes six bits rather than eight. The reason for doing this is that it allows us to locate ourselves in a stream of UTF-8 encoded characters:

If the high bit is 0, the byte stands on its own and encodes an ASCII character.
If the high bits are 11, the byte is the first byte of a character encoded as two or more bytes. The number of adjacent 1s at the high end, minus one, is the number of following bytes that contribute to this character.
If the high bits are 10, the byte is a continuation byte. The beginning of the character can be found by scanning at most five bytes back to the first byte whose high bits are 11. The beginning of the next character can be found by scanning forward until the next byte whose high bit is 0 or whose high bits are 11 is encountered.

Let's consider what it would be like if we used an encoding scheme like UTF-8, in that we use the first byte of a sequence to tell us how many more bytes contribute to that character, but in which we don't mark continuation bytes by setting their high bits to 10. Since we don't need to distinguish between the leading sequences 10 and 11, we can also modify our rule for encoding the number of bytes in a character. The number of adjacent 1s at the high end of the first byte of a character will now give us the total number of additional bytes needed to complete the character. So if a byte has high bit 0, as before, that byte is a complete character in itself. If the high bits are 10, this will now mean that a total of two bytes are used. If the high bits are 110, this will now mean that a total of three bytes are used, and so forth.

Suppose that someone transmits the Japanese word くるま /kuruma/ "wheel". In UTF-32 this is encoded as 0x304F 0x308B 0x307B . In UTF-8, this becomes:

0xE3	0x81	0x8F	0xE3	0x82	0x8B	0xE3	0x81	0xBE
11100011	10000001	10001111	11100011	10000010	10001011	11100011	10000001	10111110

In our UTF-8-like system with no continuation marker, it will be encoded like this:

0xC0	0x30	0x4F	0xC0	0x30	0x8B	0xC0	0x30	0x7B
11000000	00110000	01001111	11000000	00110000	10001011	11000000	00110000	01111011

Now, suppose that in the course of transmission the third byte is lost. A program reading the UTF-8 input will detect an error because the first byte tells it the second and third bytes must have 10 as the two highest bits. As soon as it reads the third byte (that is, the original fourth byte), whose high two bits are 11, it knows that a byte is missing. Furthermore, it knows which character is damaged and can go on to read the next character. It knows that the byte it has just read (0xE3) begins a new character since its high bits are 11. The result will therefore be ?るま, where ? stands for the unknown, damaged character. In fact, no matter which byte is lost, the damaged character will be detected and the program will be able to resynchronize and locate the next character.

Suppose that the same error, loss of the third byte, happens when we are using our pseudo-UTF-8 system. We will have started off by reading the first byte, 0xC0, which will have told us that we need two more bytes to complete the character. Since the original third byte has been lost, these will be the original second and fourth, 0x30 and 0xC0. The fact that the high bits of the fourth byte are 11 does not signal an error in this system since a continuation byte is permitted to have this pattern. The first character will therefore be taken to be 0011000011000000 = 0x30C0, which is ダ (katakana /da/). The next byte is 0x30. Since its high bit is 0, no additional bytes are needed, and it will be taken to be the digit 0. The next byte is 0x8B. The leading 1 tells us that this is the first of a two byte sequence. We strip the two high bits of the first byte and concatenate the two, yielding: 00101111000000 = 0x0BC0. This is the Tamil diacritic for the vowel /i:/. The last two bytes each have a leading 0 and so stand alone. They are the digit 0 and {. No error is detected by the program, which instead of the intended くるま produces ダ0ீ0{. A human being reading this will of course recognize it as garbled. However, he or she will have no idea what may have been intended, whereas someone who understands Japanese may well be able to guess the missing character in ?るま. Furthermore, if the error can be detected by a computer program, it may be possible to correct it immediately, whereas a human being may not look at the text until much later.

One source of resistance to using UTF-8 in some countries is that it seems to privilege English and other languages that can be written using only the ASCII characters. English only takes one byte per character in UTF-8, while most of the languages of India, for instance, require three bytes per character. By the standards of today's computer processors, storage devices and transmission systems, text files are so small that it really doesn't matter, so I don't think that this is a practical concern. It's more a matter of pride and politics.

If we don't need the extinct writing systems and other fancy stuff outside of the Basic Multilingual Plane, we could all be equal and use UTF-16. English and some other languages would take twice as much space to represent, but other languages would take the same space that they do in UTF-8 or even take up less space. At least from the point of view of those of us who aren't English imperialists, this might not be a bad idea, if not for the fact that UTF-8 has another big advantage over UTF-16: UTF-8 is independent of endianness.

Whenever a number is represented by more than one byte, the question arises as to the order in which the bytes are arranged. If the most significant bits come first, that is, are stored at the lowest memory address or at the first location in the file, the representation is said to be big-endian. If the least significant bits come first, the representation is said to be little-endian.*.

There is a third arrangment that is historically important because it was used on the Digital Equipment PDP-11 series. In PDP-endian order, the most-significant byte is the second byte, the next most significant byte is the first byte, the third most significant byte is the fourth byte, and the least significant byte is the third byte. In other words, it is "big-endian" in the sense that the first two bytes are more significant than the second two bytes, but "little-endian" in the internal ordering of the two halves.

Consider the following sequence of four bytes. The first row shows the bit pattern. The second row shoes the interpretation of each byte separately as an unsigned integer.

bit pattern 00001101 00000110 10000000 00000011

decimal value 13 6 128 3

Here is how this four byte sequence is interpreted as an unsigned integer under the three ordering conventions.

Little-Endian (13 * 256 * 256 *256) +     (6 * 256 *256) + (128 * 256) +   3 218,529,795

Big-Endian    (3 * 256 * 256 *256) + (128 * 256 *256) +     (6 * 256) + 13 58,721,805

PDP-Endian (128* 256 * 256 *256) +     (3 * 256 *256) +   (13 * 256) +   6 2,147,683,590

These orderings are often described in terms of the sequence of bytes, from least significant to most significant, like this:

Little-Endian 1 2 3 4

Big-Endian 4 3 2 1

PDP-Endian 3 4 1 2

Most computers these days are little-endian since the Intel and AMD processors that most PCs use are little-endian. Digital Equipment machines from the VAX through the current Alpha series are also little-endian. On the other hand, most RISC-based processors, such as the SUN SPARC and the PowerPC, as well as the IBM 370 and Motorola 68000 series, are big-endian. A program that determines the byte order of the machine on which it is run can be downloaded here.

UTF-16 and UTF-32 are subject to endianness variation. If I write something in UTF-16 on a little-endian machine and you try to read it on a big-endian machine, it won't work. For example, suppose that I encode the Armenian character հ ho on a little-endian machine. The first byte will have the bit pattern 01110000, conventionally interpreted as 112. The second byte will have the bit pattern 00000101, conventionally interpreted as 5. That's because the UTF-32 code, 0x570 = 1392, is equal to (5 * 256) + 112. Remember, on a little-endian machine, the first byte is the least significant one. On a big-endian machine, this sequence of two bytes will be interpreted as (112 * 256) + 5 = 373 = 0x175, since the first byte, 112, is the most significant on a big-endian machine. Well, 0x175 isn't the same character as 0x570. It's ŵ (w with a circumflex). So, if you use UTF-16 you have to worry about byte order. UTF-8, on the other hand, is invariant under changes in endianness. That is a big enough advantage that most people will probably continue to prefer UTF-8.

Sources of Information

The fullest information is found in the Unicode standard. This is available on the Unicode Consortium web site [http://www.unicode.org], in print form, and on CD. Two files that can be obtained from the web site or the CD are often useful. The file UnicodeData.txt contains details of most characters. It is a plain text file in which, for the most part, each line contains information about one character. Each such line contains a series of fields separated by semi-colons. The first field is the code, in hexadecimal; the second field is the name of the character. The other fields contain additional information of various sorts.

UnicodeData.txt is intended primarily to be read by machines. Another file, NamesList.txt, contains a subset of the information in UnicodeData.txt, omitting details primarily of use to computer programs, reformatted to be more readable by human beings. This is the best place to look for a character by name.

Both of these files omit character-by-character descriptions for the Chinese characters. This information is kept in a separate file, Unihan.txt [zip compressed version] since it is voluminous (25Mb uncompressed, 5Mb compressed) and not needed by many users. This file does not give simple descriptions of the Chinese characters comparable to those for other characters; for the most part, the information given consists of cross references to various reference works.

Tools

A useful tool for dealing with Unicode is yudit, a Unicode text editor. If supplied with the appropriate fonts (sources for which are listed on the yudit website), yudit can display UTF-8 text. You can edit the displayed text, and you can enter text in several ways. By using a keymap you can type in a romanization and have the text appear in whatever writing system you choose. Numerous keymaps are supplied with yudit, but it is not difficult to write your own if necessary. If you know the code for the character you want to enter, you can enter it by its numerical code. yudit also recognizes Chinese characters drawn with the mouse. If you move the mouse over a character and left-click, yudit will display the corresponding character code.

Here is a screenshot of yudit displaying a sampling of writing sytems.

Sometimes it is useful to find out about the content of a document for which you do not have the necessary fonts, which is in a writing system that you do not understand, where you want to look at characters that are not directly visible, or where you want information about exactly how the text is encoded. Two programs useful for such purposes are unidesc and uniname, both of which can be downloaded here. unidesc reports the character ranges to which different portions of the text belong. It can also identify Unicode encodings flagged by magic numbers. uniname prints the byte offset of each character, its hex code value, its UTF-8 encoding, and its name.

A convenient tool for converting from one Unicode encoding to another is uniconv, which comes with the yudit editor. uniconv can convert from one Unicode encoding to another, or between Unicode and various other encodings. In addition to a number of built-in encodings, uniconv can use keymaps created for use with yudit. For example, if you have a keymap that allows you to enter text into yudit in romanization and have it appear in a non-Roman writing system, uniconv will use the same keymap to convert text from that romanization to Unicode or another encoding. The GNU program iconv can also convert between Unicode encodings and between Unicode and numerous other character encodings.

bit pattern	00001101	00000110	10000000	00000011
decimal value	13	6	128	3

Little-Endian	(13 * 256 * 256 256) + (6 256 256) + (128 256) + 3	218,529,795
Big-Endian	(3 * 256 * 256 256) + (128 256 256) + (6 256) + 13	58,721,805
PDP-Endian	(128* 256 * 256 256) + (3 256 256) + (13 256) + 6	2,147,683,590