Unicode encodings UTF8 and UCS on Windows and Linux

Here are some issues I faced when I had to support UTF8 on Windows and Linux.

Universal Character Set (UCS) is a set of all characters which are ever needed in a software. UCS defines integer code for each symbol.
The ASCII characters set is included in UCS using same values from 00 to 7F. First 65,536 codes (you can store it in 16 bits) from UCS are called Basic Multilingual Plane or BMP. Most people languages are included in BMP so normally we need 2 bytes to encode any relevant human language symbol.

Unicode is another standard but it defines same codes for the same characters which present in UCS set. When UCS or UNICODE is mentioned this normally means we have the same character table.
So we have a well defined table of integer codes for any character. There are standards which define how these codes can be stored in memory.
These standards are called encoding formats.
There are many encodings but most frequently used are those mentioned below.
Note that there can be two versions of each encoding: LE and BE which is Little Endian and Big Endian.
This notations implies the byte order sequence

UCS-2: any encoded symbol is a sequence of two bytes. Only BMP set is supported.

UTF-16: extended version of UCS-2 which represents any BMP character as a 16 bits (2 bytes) sequence like UCS-2. However, unlike UCS-2, UTF-16 also supports 32 bits (4 bytes) sequences to encode those symbols which are not included in BMP.

UTF-8: most suggessted encoding for unicode characters. Character sequences can be up to six bytes. For ASCII chars UTF8 uses one byte, for BMP symbols there are one to three bytes. UTF8 is often called multibyte encoding. UTF8 is backward compatible with ASCII so any latin text would be encoded as one byte per character.

To support Unicode on Windows and Linux one must know at least the following facts.

Some Windows applications introduce Byte Order Mark (value 0xFEFF) which is used in the beginning
of a text file to specify encoding and byte order. In Windows API there is a function named IsTextUnicode()
which runs several tests on a character buffer. If possible this function tries to find BOM value to determine
byte order and encoding. Linux and Posix systems do not use Byte Order Mask or any other mechanism to indicate encoding and byte order.

To set encoding rule and character set all operating systems use “locales“.
Locale is a named structure which defines how to store characters, currencies and other not so important settings.

For this ANSI C standard defines setlocale() function. Here we have one problem since some locale names on Windows and Linux are different.
For example locale “ru_RU.utf8″ on Linux would be “Russian_Russia.65001″ on Windows.
Some sources including guys at Microsoft say that VS 8.0 (and also 7.0) does not support setting UTF8 encoding
explicitly through setlocale() call. However, the following example worked fine on my machine with VC 7.1 (.NET 2003):

#include <stdio.h>
#include <locale.h>
int main(int argc, char * argv[])
{
	char * p = setlocale(LC_ALL, "Russian_Russia.65001");
	if (p) printf("changed locale to: %s\n", p);
	else printf("failed to change locale\n"); 

	p = setlocale(LC_ALL, ".65001");
	if (p) printf("changed locale to: %s\n", p);
	else printf("failed to change locale\n"); 

	p = setlocale(LC_ALL, ".65001");
	if (p) printf("changed locale to: %s\n", p);
	else printf("failed to change locale\n"); 

	return 0;
}

So far it is fine if you are able to set locale on both OSes.
There are generally two types of strings which can contain Unicode symbols.
Multibyte strings are plain (char*) buffers which can contain encoded Unicode sequences. Normally multibyte strings use UTF8 as encoding mechanism. So that means any symbol can occupate up to six bytes of memory when it is encoded in multibyte character string.

Another type of strings is wide char strings. These occupy fixed amount of bytes for encoding each character. The type for storing each wide symbol in C is called “wchar_t“. For wide char strings C++ defines “std::wstring” template which works just like “std::string” but allocates sizeof(wchar_t) bytes for each symbol. Multibyte character strings usually contain UTF8 encoded symbols while wide char strings store unicode character in UCS2 encoding.

To convert between these two encodings ANSI C supplies functions like mbtowc(), wctomb().
When such conversion is done C runtime library reads current locale settings (can be set with setlocale()) and applies current character set to transform the symbols correctly from one encoding to another.
This is something I really didn’t understand since Unicode clearly defines codes for each symbol and it does not depend on locale or any other localized setting. I thought that there is no need for character table if
transformation is done from UTF8 to UCS2 or vice versa. However I have read that wchar_t was not always used to contain UCS codes so in Asia they used other 16 bit encodings to support native language with wchar_t. That’s why mbtowc(), wctomb() depend on locale since it is not clear that transformation is done to (from) UCS in case of wide string destination (source).

Things are getting more complicated since on Windows sizeof(wchar_t) is 2 bytes but on Linux with gcc it is 4 bytes long. Take this in mind if you plan to develop applications which transfer Unicode data from various systems. Hence wchar_t on Windows supports only BMP (most frequently used languages) but wchar_t on Linux supports more characters including those required to scientific use.

There is a compilation flag for gcc which sets 2 bytes for wchar_t: “-fshort-wchar“.
However, you must be really careful with setting this flag on gcc since all binaries including database drivers
and probably system binaries must support the same rule for wide char strings.
Another completely different story is about configuring terminals to support Unicode strings.

You can read about it UTF-8 and Unicode FAQ for Unix/Linux.
This text also contains a lot of useful information including example of outputting unicode strings to a file.

Another nice post about working with Unicode strings is here.

By the way, the Unicode Consortium provides free C code for converting strings in
UTF8,UTF16,UTF32 formats. It can be adapted for any C++ program and it works on both Linux and Windows.

For testing I would recommend using the UTF converter which shows results straight on the page.

Leave a Reply