If you encounter display problem with Unicode characters where a single Unicode character appears as two empty square boxes, you should read this article to find out how to fix it. Specifically, this problem occurs when your Unicode character is greater than U+FFFF (65635 decimal) and your Unicode font can handle only characters that are 64K in value or less.
Unicode characters can range in scalar values from 0 to over a million. Characters above 64K (greater than U+FFFF or 65635) are called supplementary Unicode characters. The entire range of Unicode characters is divided into 17 blocks of 64K values each. Each block is referred to as a plane and is numbered starting from 0 as follows:
Plane 0: contains U+0000 thru U+FFFF. This is commonly known as the Basic Multilingual Plane (BMP).
Plane 1: contains U+10000 thru U+1FFFF.
Plane 2: contains U+20000 thru U+2FFFF. The majority of characters in this plane are Unicode Extension B (Hán/Korean/Japanese/Nôm).
Plane 3: contains U+30000 thru U+3FFFF.
...............
Plane 16: contains U+100000 thru U+10FFFF.
Note that all characters in Plane 0 can be represented as a single 16-bit value. Characters in other planes are greater than 64K and can be represented as a single 32-bit value (UCS-4) or a pair of two 16-bit values (UTF-16). In the latter representation the pair is commonly called surrogate pair, which consists of a high-order 16-bit surrogate and a low-order 16-bit surrogate.
Unfortunately Windows knows and processes only 16-bit Unicode characters by default. When Windows encounters a surrogate pair, it thinks there are two distinct 16-bit characters and simply displays as such. This turns out to be always two empty square boxes. The reason is that each of the 16-bit surrogate piece is a forbidden 16-bit character in the Unicode standards and is represented as an empty square box in Unicode fonts; only the combination of the two surrogate pieces together yields a single Unicode character.
To make the long story short:
Surrogate characters are representations of Unicode characters greater than 64K. In practical terms, when a Unicode character is 5 or 6 significant hex digits long such as U+1FFAB or U+2AB45, it is represented by a surrogate pair.
If your text contains surrogates, they simply appear as two empty square boxes unless you tell Windows to use surrogate fonts by appropriate settings in Windows registry.
According to Microsoft
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_192r.asp
users have to set up Windows registry as follows:
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\LanguagePack]
SURROGATE=(REG_DWORD)0x00000002
[HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\International\Scripts\42]
IEFixedFontName=(Surrogate Font Face Name)
IEPropFontName=(Surrogate Font Face Name)
Additional information is available at
http://www.i18nguy.com/surrogates.html: detailed explanation of these registry entries
http://www.daouyen.com/NomDoc/CJKVB.htm: instructions in Vietnamese
Based on the information obtained from the websites above, WinVNKey provides a user-interface to help users change the registry easily:
This setting tells Windows to load Uniscribe, which is an engine to process surrogate characters
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\LanguagePack]
SURROGATE=(REG_DWORD)0x00000002
This setting specifies the names of fixed and proportional fonts for Internet Explorer to use to display surrogate characters
[HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\International\Scripts\42]
IEFixedFontName=[Surrogate Font Face Name]
IEPropFontName=[Surrogate Font Face Name]
This setting is for Windows XP systems only and is optional. Basically, this setting specifies the fallback fonts for characters in supplementary planes. You can specify as many planes as you like.
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\WindowsNT\CurrentVersion\LanguagePack\SurrogateFallback]
Plane1=(Name for fallback font for characters in Plane 1)
Plane2=(Name for fallback font for characters in Plane 2)
... etc. ...
The three registry settings above can be changed by using WinVNKey as follows:
Click on Run button ==> Preferences ==> Surrogate Fonts
Note there are no OK/Cancel buttons on the "Surrogate Fonts" page. Any change you make to the page will become effective immediately.
Setting 1:
Enable the checkbox and change surrogate registry to 2 as shown below. This will force Windows to process surrogate characters.
This change is enough to display surrogate characters in your applications such as Microsoft Word, etc. Of course, when you type in surrogate characters, you have to select a Unicode font that support your characters in order to see it.
Setting 2:
If you want Internet Explorer to use certain fonts when it encounters surrogate characters, you can specify the font names as follows.
Setting 3:
Finally, you can specify fallback fonts for Windows XP or later.
Suppose you are writing a document in Microsoft Word using "Arial Unicode MS" font. This font does not have any surrogate characters. If you type Han/Nom surrogate characters in your document, you will see empty boxes in place of the characters. In this case, if you tell Windows in advance what font is fallback for Plane 2 characters, Windows will use that font to display the Han/Nom surrogate characters. In the example below the author uses "Han Nom 3.1B", but you should use the font you have in your system.
After setting up surrogate registry with supplementary fonts, you can test if the system works. You can browse several websites that use supplementary characters and check if you can see them. These sites are listed in
http://www.i18nguy.com/surrogates.html
http://www.daouyen.com/NomDoc/CJKVB.htm
Specifically,
http://www.daouyen.com/NomDoc/CuTranLacDao.htm
http://www.i18nguy.com/unicode/plane1-utf-16.html
http://www.i18nguy.com/unicode-plane1-utf8.html
http://www.i18nguy.com/unicode-example-plane1.html
http://www.i18nguy.com/unicode/unicode-example-intro.html
http://homepage.mac.com/thgewecke/BeyondBMP.html
If you have no fonts for supplementary characters, you can download from the Internet for free. At the present time the author of WinVNKey knows one such font, CODE2001.ZIP (207,345 bytes), which covers a number of characters in Plane 1 (U+10000 through U+1FFFF):
http://home.att.net/~jameskass/code2001.htm
This font certainly does not contain Han/Nom surrogate characters in Plane 2 (Extension B)
Plane 2 contains mostly Han Nom surrogate characters in Extension B. Commercial fonts for Extension B characters are available. The largest font is perhaps SURSONG.TTF (41MB), which contains about 65K Han/Nom characters both in
the BMP plane and Plane 2. It is shipped with the Chinese version of Windows and with Microsoft Office Proofing Tools. You can search on www.google.com for "sursong.ttf" and may be lucky enough to find a site that offers free download of the font.
There are a few free Han Nam fonts that are downloadable from the Viet Unicode website:
1. Microsoft font "Arial Unicode MS" (Aruniupd.exe, 14 MB). This is not a Plane 0 font, not a surrogate font.
2. Fonts "HAN NOM A" (not surrogate font) and "HAN NOM B" (surrogate font), which are packaged together in HannomH.zip (27 MB, high resolution) or Hannom.zip (19 MB, low resolution). "HAN NOM B" is a surrogate font.
Font "Arial Unicode MS" contains Han/Nom and Latin-based characters less than 64K in values, i.e., Unicode characters expressible as U+xxxx where there are at most 4 hex digits following the plus sign.
Font "HAN NOM A" contains Han/Nom characters less than 64K in values.
Font "HAN NOM B" contains Han/Nom surrogate characters in Unicode Extension B. These characters are greater than 64K in values, i.e., they are expressible as U+xx..xx with 5 or 6 hex digits following the plus sign.
It is generally good to download both Aruniupd.exe and HannomH.zip. Font "Arial Unicode MS" has lots of characters from many languages but lacks all Han/Nom characters in Extension B. Therefore you need to download HannomH.zip. Always try the high resolution version HannomH.zip first. If the installation fails because your Windows does not recognize the high resolution font, you then try the low resolution version Hannom.zip. You should not install two packages HannomH.zip and Hannom.zip because the latter file will replace the first.
Because these font packages are large, if you can afford downloading one package, you should choose HannomH.zip (two fonts HAN NOM A and HAN NOM B).