Because I always forget how Unicode really works…

Some notes from work, where I had to summarise what Unicode really is (no, not ‘just two bytes per character’…). Mostly just copied from Wikipedia, though with some of the gory details removed.

Unicode defines a ‘code page’. I think of this as a ‘character space’, although this isn’t quite right. Anyway, the point is, Unicode is the set of things which, ultimately, will be displayed.

The encodings are ways of having a bunch of bytes reference one of the points in the code page. They do have different properties. UTF-16 and UTF-8 are the most widely used, and I’d suggest always using UTF-8 unless you’ve got a reason to do otherwise.

UTF-16

This encoding uses 16 bits per character for characters within the Basic Multilingual Plane(BMP). For characters in other planes, it uses a second 16 bit word (called a surrogate pair), but as we don’t use characters outside the BMP (it’s rare we go outside basic Latin) we don’t normally see this. Incidentally, it’s this encoding that makes developers often think that Unicode means 2 bytes per character; that’s not necessarily the case.

UTF-16 Big Endian, Little Endian

Big Endian and Little Endian define the bit order of these 2 words in the encoding. Apparently, this is for performance reasons on different processors.

“Is UTF-16 Big or Little Endian if you don’t specify it?”

Both. The UTF-16 (and UCS-2) encoding scheme allows either endian representation to be used, but mandates that the byte order should be explicitly indicated by prepending a Byte Order Mark before the first serialized character. This BOM is the encoded version of the Zero-Width No-Break Space (ZWNBSP) character, codepoint U+FEFF, chosen because it should never legitimately appear at the beginning of any character data. This results in the byte sequence FE FF (in hexadecimal) for big-endian architectures, or FF FE for little-endian. The BOM at the beginning of a UTF-16 or UCS-2 encoded data is considered to be a signature separate from the text itself; it is for the benefit of the decoder. Technically, with the UTF-16 scheme the BOM prefix is optional, but omitting it is not recommended as UTF-16LE or UTF-16BE should be used instead. If the BOM is missing, baring any indication of byte order from higher-level protocols, big endian is to be used or assumed. The BOM is not optional in the UCS-2 scheme.

The UTF-16BE and UTF-16LE encoding schemes (and correspondingly UCS-2BE and UCS-2LE) are similar to the UTF-16 (or UCS-2) encoding scheme. However rather than using a BOM prepended to the data, the byte order used is implicit in the name of the encoding scheme (LE for little-endian, BE for big-endian). Since a BOM is specifically not to be prepended in these schemes, if an encoded ZWNBSP character is found at the beginning of any data encoded by these schemes is not to be considered to be a BOM, but instead is considered part of the text itself. In practice most software will ignore these “accidental” BOMs.

UCS-2

Predecessor to UTF-16, it is identical except that it doesn’t have surrogate pairs. Therefore, it can only encode characters in the BMP.

UTF-8

A variable-length character encoding for Unicode. While it can represent any Unicode character, the main benefit is that the binary codes for UTF-8 and ASCII are the same (requiring little or no change for software that handles ASCII but preserves other values). For these reasons, it is steadily becoming the preferred encoding for email, web pages, and other places where characters are stored or streamed. It uses 1 to 4 bytes per character.

UTF-32 or UCS-4

A full 4-byte encoding for Unicode. It has the advantage that it is simple, as it isn’t variable length per character. However, it is very inefficient, as characters outside the BMP are rarely used. For us, what with all the Latin characters that we use, it would add 3 unnecessary bytes per character. UCS-4 is a similar encoding with a slightly larger code page – but the extra space is reserved. Therefore, we can assume that UCS-4 and UTF-32 are the same.

UTF-7

UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode-encoded text using a stream of ASCII characters, for example for use in Internet e-mail messages.

The basic Internet e-mail standard SMTP specifies that the transmission format is US-ASCII and does not allow byte values above the ASCII range. MIME provides a way to specify the character set allowing for use of other character sets including UTF-8 and UTF-16. However the underlying transmission infrastructure is still not guaranteed to be 8-bit clean and therefore content transfer encodings (e.g. base64) have to be used with them.

CESU-8

CESU-8 is a variant of UTF-8.It is similar to Java’s Modified UTF-8 but does not have the special encoding of the NUL character (U+0000). Like Modified UTF-8, it can be decoded into one UTF-16 word at a time. Because it doesn’t have special treatment of NUL, the resulting string will not be safe for NUL-terminated string handling if the original string contained NUL characters.

In practice, CESU-8 is often used to communicate with the Oracle database software, which in modern configurations apparently uses UTF-16 as an internal character representation. Oracle’s “UTF-8” (actually CESU-8) codec rejects proper UTF-8 sequences for characters from outside the Basic Multilingual Plane, but happily accepts and generates technically invalid UTF-8 sequences for code points in the surrogate range (U+D800 .. U+DFFF), as specified in CESU-8.

Other Definitions:

BOM – Byte Order Mark

The character at code point U+FEFF (“zero-width no-break space”), when that character is used to denote the endianness of a string of UCS/Unicode characters encoded in UTF-16 or UTF-32 and/or as a marker to indicate that text is encoded in UTF-8, UTF-16 or UTF-32.

In most encodings the BOM is a sequence which is unlikely to be seen in more conventional encodings or other Unicode encodings (usually looking like a sequence of obscure control codes). If a BOM is misinterpreted as an actual character within the text then it will generally be invisible due to the fact it is a zero-width no-break space. The “zero-width no-break space” semantics of the U+FEFF character has been deprecated in Unicode 3.2, allowing it to be used solely with the semantic of BOM.

BMP – Basic Multilingual Plane

The Unicode code space for characters is divided into 17 planes, each with 65,536 code points, although currently only a few planes are used:

Plane 0 (0000

Andy Burns' Blog

Whatever I'm working on

Because I always forget how Unicode really works…

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply