Tchar I18N Text Abstraction
This documention describes how to write C software that will support international charsets and compile and run the resulting programs on different platforms such as Linux, Microsoft Windows, BSD, or MacOS/X with little or no modification. In short, macros and typedefs are used to abstract the character type and all functions that operate on it. This will permit the software to be compiled using plain 8 bit, multi-byte, or wide character encodings. Very little extra work is necessary to benifit from this technique although there are pitfalls that will be described in detail.
Like most modules in Libmba, the ideas and code can be extracted or adapted to meet the needs of your application and do not bind you to a particular "environment" or library. Although the text.h typedefs and macros are used throughout the libmba package, users can simply choose to ignore them and pass char * to functions that accept tchar * (or wchar_t * if libmba has been compiled with -DUSE_WCHAR).
Unicode, Charsets, and Character Encodings
To use this technique successfully it is essential to understand that each non-ASCII character may occupy a variable number of bytes in memory. Examples will be given below that illustrate why this is important but first some background information about Unicode, charsets, and character encodings might be useful.
Consider the Russian character called CYRILLIC CAPITAL LETTER GHE which looks like an upside down 'L' has a Unicode value of U+0413. This character's value will be different depending on which charset is being discussed but Unicode is the international standard superset of virtually all charsets so unless otherwise specified Unicode will be used to describe characters throughout this documentation. The number of bytes that the U+0413 character occupies depends on the character encoding being used. Notice that a charset and a character encoding are different things.
Charsets A charset (or characterset) is a map that defines which numeric value represents a particular character. In the Unicode charset CYRILLIC CAPITAL LETTER GHE is the numeric value 0x0413 (written U+0413 in the Unicode standard convention). Some example charsets are ISO-8859-1, CP1251, and GB2312.
Character Encodings A character encoding defines how the numeric values representing characters are serialized into a sequence of bytes so that it can be operated on in memory, stored on a disk, or transmitted over a network. For example, at least two bytes of memory would be required to hold the value 0x0413. If this value were simply stored as the byte 0x04 followed by the byte 0x13 this encoding would be UCS-2BE meaning Unicode Character Set, 2 bytes, and big-endian. The following are examples of some other Unicode character encodings:
-
UCS-4 simply serializes each Unicode value into a sequence of 4 bytes. The largest possible Unicode value can be encoded in 32 bits. This is not a variable width encoding. If the string "hello", which is 6 characters if the '\0' terminator is included, was encoded in UCS-4 it would occupy 6 * 4 or 24 bytes of memory. Glibc encodes wide characters (wchar_t) using this encoding.
-
UCS-2 is like UCS-4 but only 2 bytes are used to encode each character. This is a much more reasonable use of memory but it cannot encode every value of the Unicode charset. This is not a variable width encoding. If the string "hello" including a '\0' terminator was encoded in UCS-2 it would occupy 6 * 2 or 12 bytes of memory.
-
UTF-16 is like UCS-2 but it uses certain bits to indicate that an additional pair of bytes follows which permits values of more 16 bits to be encoded. So most characters are encoded in 2 bytes but 4 bytes may be used if necessary. Therefore, UTF-16 is a variable width encoding. The Microsoft Windows platform uses UTF-16LE (the LE means little-endian) almost exclusively to represent Unicode strings. For a description of UTF-16 read the description of UTF-16 at czyborra.com.
-
UTF-8 is like UTF-16 because it uses certain bits (actually just the highest bit) to indicate that additional bytes follow which permits values of more than 8 bits to be encoded. The number of additional bytes varies and may be as many as 6. This is called a "multi-byte sequence". Therefore UTF-8 is also a variable width encoding. It cannot be assumed that non-ASCII characters will occupy one byte of memory. It is not correct to calculate the number of characters in the string by subtracting a pointer to the beginning of the string from a pointer to the end.
For example the sequence of bytes representing the Unicode character U+0413 in UTF-8 are 0xD0 followed by 0x93. This multi-byte sequence was determined by using a hexedit program to create a file called ucs2.bin containing 2 bytes; 0x04 followed by 0x13. As described before, this encoding is UCS-2BE. To convert the file from UCS-2BE to UTF-8 the command $ iconv -f UCS-2BE -t UTF-8 ucs2.bin > utf8.bin was used followed by hexedit again to view the results.
UTF-8 is the premier character encoding used on Unix and Unix-like platforms. For a complete description of UTF-8 read the UTF-8 and Unicode FAQ for Unix/Linux.
I18N Text Handling
There are primarily three techniques for managing I18N strings in a C program.
-
Plain C Strings Traditionally C strings used the plain char type with an 8 bit charset which does not require a variable width encoding. Charactersets other than ASCII or ISO-8859-1 (a.k.a. Latin1) can be used if a different codepage is set but the program is then largely committed to that one charset. The Microsoft Windows codepage for Russian is called CP1251 or WinCryllic. The hexcode for CYRILLIC CAPITAL LETTER GHE in CP1251 is 0xC3.
-
Wide Character Strings To permit multiple international charsets to be used within the same program, wide character strings were introduced that defined the larger wchar_t type. This character type does not define the charset or encoding used by the host C library however all of the platforms mentioned previosly use the Unicode charset.
-
Multi-Byte Strings Because wchar_t is larger than one byte, many programs would require significant modification to be converted to use wide characters. The UTF-8 encoding was devised specifically to permit the traditional char pointer strings to be used which reduces the complexity of internationalizing existing programs and kernel structures.
The Tchar Text Abstraction
To write a program that will compile and run without modification on Linux, Windows, and a variety of other platforms is a matter of abstracting the techniques listed above used by each platform. Linux uses both multi-byte and wide character encodings. Windows uses wide characters however it is important to note that Windows does not support a UTF-8 locale so if Unicode is desired wide character strings are the only option. Most other Unix and Unix-like systems support multi-byte strings as well as possibly wide character strings to different degrees. Programs written using the technique described here will still premit using runtime defined codepages using the standard setlocale(3) mechanism.
The Tchar Type
The idea behind this technique is to use a typedef for the character type that resolves to either plain char or wchar_t. In this way the character type identifier does not change in the source code.
#ifdef USE_WCHAR
typedef wchar_t tchar;
#else
typedef unsigned char tchar;
#endif
Abstracting String Functions
In addition to the character type, all functions that operate on it will need to be abstracted with macros that reference tchar rather than char or wchar_t. Consider the strncpy function. It uses the plain char type. Fortunately the major string functions have a wide character equivalent that usually has the same signature but accepts the wchar_t type.
char *strncpy(char *dest, const char *src, size_t n);
wchar_t *wcsncpy(wchar_t *dest, const wchar_t *src, size_t n);
From the above signatures it can be seen that the only difference is the character type. The number, order, and meaning of the parameters are the same. This permits the function to be abstracted with macros as follows:
#ifdef USE_WCHAR
#define tcsncpy wcsncpy
#else
#define tcsncpy strncpy
#endif
To use this function is now a matter of substituting all instances of strncpy or wcsncpy with tcsncpy. Depending on how the program is compiled, code that uses these functions will support wide character or multi-byte strings (but not both at the same time). See the Text Module API Documentation for a complete list of macros in text.h.
There are of course many other functions that operate on strings. Fortunately most standard C library function have wide character versions that are reasonably consistent about identifier names. An identifier that begins with str will likey have a wide character version that begins with wcs. Other functions like vswprintf are not so obvious and depending the the system being used there will certainly be omissions or incompatablities (e.g. the vsnprintf counterpart wide character function is vswprintf without the n even though it accepts an n parameter). If a function does not have a man page or if the compiler issues a warning it does not necessarily mean the function does not exist on your system. For example, with the GNU C library it may be necessary to specify C99 or define _XOPEN_SOURCE=500 to indicate a UNIX98 environment is desired. Check your C library documentation (e.g. /usr/include/features.h). Check the POSIX documentation on the OpenGroup website. On my RedHat Linux 7.3 system the wcstol and several other conversion functions are not documented. It is necessary to specify -std=c99 or define -D_ISOC99_SOURCE with gcc to trigger it to export that symbol.
Variable Width Encodings
Unicode on Unix and Unix-like systems is supported using UTF-8. On Microsoft Windows UTF-16LE is used. As explained previously these are variable width encodings. Each character can occupy a variable number of bytes in memory. The question is; when does this require special processing in your code?
A good example of when UTF-8 string handling requires special hanlding is when each character needs to be examined individually. Consider the example of caseless comparison of two strings. They cannot simply be compared element by element. Each character must be decoded to their wide character value and converted to upper or lowercase for the comparison to be valid. Below is just such a function:
/* Case insensitive comparison of two UTF-8 strings
*/
int
utf8casecmp(const unsigned char *str1, const unsigned char *str1lim,
const unsigned char *str2, const unsigned char *str2lim)
{
int n1, n2;
wchar_t ucs1, ucs2;
int ch1, ch2;
mbstate_t ps1, ps2;
memset(&ps1, 0, sizeof(ps1));
memset(&ps2, 0, sizeof(ps2));
while (str1 < str1lim && str2 < str2lim) {
if ((*str1 & 0x80) && (*str2 & 0x80)) { /* both multibyte */
if ((n1 = mbrtowc(&ucs1, str1, str1lim - str1, &ps1)) < 0 ||
(n2 = mbrtowc(&ucs2, str2, str2lim - str2, &ps2)) < 0) {
PMNO(errno);
return -1;
}
if (ucs1 != ucs2 && (ucs1 = towupper(ucs1)) != (ucs2 = towupper(ucs2))) {
return ucs1 < ucs2 ? -1 : 1;
}
str1 += n1;
str2 += n2;
} else { /* neither or one multibyte */
ch1 = *str1;
ch2 = *str2;
if (ch1 != ch2 && (ch1 = toupper(ch1)) != (ch2 = toupper(ch2))) {
return ch1 < ch2 ? -1 : 1;
} else if (ch1 == '\0') {
return 0;
}
str1++;
str2++;
}
}
return 0;
}
This is a fairly pathological example. In practice this is probably as difficult as it gets. For example, if the objective is to search for a certain ASCII character such as a space or '\0' termniator, it is not necessary to decode a Unicode value at all. It might even be reasonable to use isspace and similar functions (but probably not ispunct for example). This will require some experimenting and research.
Another example of when using a variable width encoding requires special handling in your code is when calculating the number of bytes from the string required to occupy at most a certin number of dispay positions in a terminal window. In this case it is necessary to convert each character to it's Unicode value and then use the wcwidth(3) function. When the total of values returned by wcwidth(3) equals or exceeds the desired number of columns the number of bytes traversed in the substring is known.
Potential Problems
This technique is not perfect. The wide character functions were not designed with this technique in mind. The prototypes are largely the same only for the sake of consistency. It is important to understand where problems can occur and understand how to correctly fix or avoid them.
-
Wide Character I/O
It is not possible to mix wide character I/O functions like wprintf, fgetwc, and fputws with regular I/O functions. If a wide character I/O function is used the associated stream will switch into wide mode. Attempting to use both will result in erroneous behaviour (e.g. ESPIPE Illegal seek). All I/O could be performed in wide mode but on non-Microsoft platforms it can be awkward to performa all I/O entirely in wide character mode. Note however this restriction only applies to functions that cause I/O on a stream. For example, the swprintf function is ok with non-wide I/O. For non-Microsoft platforms is recommended that the wide character I/O functions simply be avoided. They ultimately just convert wide characters to multi-byte sequences and if an unexpected encoding is encountered it will be more difficult to detect and perform corrective action.
Ultimately if the target code is reading and writing plain text to sockets or files on the filesystem the text will probably need to be converted to and from a well defined encoding like the locale dependant encoding with wcsrtombs(3) and mbsrtowcs(3). Currently the libmba text module does not define macros for the wide character I/O functions but that may changed in the future. See src/cfg.c for a good example of converting between wide character strings and the multi-byte encoding in files and the environment.
-
The File System
Another form of wide character I/O pertains to the handling of file and directory names. The encoding used to read and write path names depends on the operating system and filesystem. On Linux for example, wide character strings cannot be used as filenames. Linux requires that the multi-byte encoding be used. This means that any wide character pathname must be converted to the locale dependant encoding using wcsrtombs(3) and any pathname read from the operating system may need to be converted to the wide character encoding using mbsrtowcs(3).
The below source illustrates how a wide character pathname could be converted to the multi-byte encoding for passing to fopen(3).
/* Open a file using a wide character path name.
*/
FILE *
wcsfopen(const wchar_t *path, const char *mode)
{
char dst[PATH_MAX + 1];
size_t n;
mbstate_t mb;
memset(&mb, 0, sizeof(mbstate_t));
if ((n = wcsrtombs(dst, &path, PATH_MAX, &mb)) == (size_t)-1) {
return NULL;
}
if (n >= PATH_MAX) {
errno = E2BIG;
return NULL;
}
return fopen(dst, mode);
}
Currently, libmba modules that support the tchar abstraction do not accept wide character pathnames but that may change in the future.
Note that Unicode pathnames are supported by Unix and Unix-like systems that support the UTF-8 multi-byte encoding. Just call setlocale(LC_CTYPE, "en_US.UTF-8") first. Or export LCTYPE=en_US.UTF-8 in the environment and call setlocale(LC_CTYPE, ""). To test such a program it will be necessary to see I18N text printed somewhere. The following is a worthwhile exercise:
$ wget http://www.columbia.edu/kermit/utf8.html
$ xterm -u8 -fn '-misc-fixed-*-*-*-*-20-*-*-*-*-*-iso10646-*'
$ LANG=en_US.UTF-8 cat utf8.html
This downloads a file with a wide range of UTF-8 encoded text in it, launches an xterm in UTF-8 mode with a Unicode font, and runs cat in the UTF-8 locale to print the contents of utf8.html to the terminal window. Some newer Linux systems use the UTF-8 locale by default now so the above setup may not be necessary.
-
Format Specifiers
To format a string with snprintf the format specifier %s is used. For this text abstraction to work completely the equivalent wide character function swprintf would have to use %s as well for the format specifier for wide character strings. It does not. Both snprintf and swprintf use %s to specify regular strings and %ls to specify wide character strings. This means that even though stprint resolves to either snprintf or swprintf the format specifiers need to be different depending on the arguments to stprintf. This will require some conditional preprocesing such as the following example.
#if defined(USE_WCHAR)
if ((n = swprintf(path, L"/var/spool/mail/%ls", username)) == -1) {
#else
if ((n = snprintf(path, "/var/spool/mail/%s", username)) == -1) {
#endif
-
Prototype Mistatch
It is not uncommon for prototype mismatches to occur. Some examples are:
-
Simple Errors
Be dilligent when manipulating text directly now that characters occupy more than 1 byte. Frequently this just means multiplying some value by the size of an element such as when calulating the number of bytes occupied by a run of text (or use tmemcpy):
siz = (src - start + 1) * sizeof *src;
There are most certainly other problems and incompatabilies that I have omitted here. If you encounter any such example, please drop me a mail.
TCHAR in Microsoft Windows
For programmers that have used the variety of string handling functions on the Microsoft Windows platform this character abstraction technique should look familar. It is indeed the same. The abstract character type in the Win32 environment is named TCHAR in uppercase rather than lower and the string functions are prefixed with _tcs like _tcsncpy rather than tcsncpy but after macro processing the resulting code is the same. The identifier names where chosen to be the same as those found on Windows (minus a few Windows coding conventions that clash with Unix/Linux conventions) simply becuase the Windows platform is very popular and there was no practical reason to use different names. The exception is that USE_WCHAR is used to signal that wide characters should be used rather than _UNICODE because on Unix and Unix-like systems multi-byte strings support Unicode in the UTF-8 locale which would make the _UNICODE macro somewhat inaccurate.