Unicode and UTF-8

- November 11, 2020

Photo by Tim Marshall on Unsplash

Unicode aims to create a worldwide convention that assigns a unique number to each character (or code point) so that they can be represented everywhere by that number. Unicode can address more than 1 million characters and currently adresses more than 0.1 million characters. Those are the characters from all the languages. It also includes some symbols and emojis. The technical name for a unit is a code point. A maximum of 1,114,112 code points can be assigned (17 * 65536).

As there are more than 2 ^ 16 Unicode code points, it means we cannot represent all of them in a two byte representation. A Byte is a number between 0 and 255, represented as between 0 and FF in hexadecimal. A combination of 2 bytes could represent up to 65.536 characters.

A combination of 4 bytes could represent more than 4 billion characters, and it's enough to contain all Unicode code points. But storing each character in a 4 bytes chunk is inefficient as most ASCII character only need 1 byte.

People have found a trick (UTF-8) to represent all the existing code points by using a 1 byte unit, that uses combination and sequences of 1 byte units to represent more complex code points, rather than using 2 bytes (utf-16) or 4 bytes (utf-32) units.

Whenever we read the value of a UTF-8 encoded text byte, we can immediately determine if it's either (1) an ASCII character, (2) the start of a byte sequence, (3) or a continuation byte.

(1) The byte represents a single byte ASCII character

This is the case when the byte value is between 0 and 127

0 to 127 0 to 7F 0 to 0111 1111

(2) The byte represents the start of a byte sequence

The byte starts a 2 bytes sequence when his value is between:

194 to 223 C2 to DF 1100 0010 to 1101 1111 30 different values

The byte starts a 3 bytes sequence when his value is between:

224 to 239 E0 to EF 1110 0000 to 1110 1111 16 different values

The byte starts a 4 bytes sequence when his value is between:

240 to 244 F0 to F4 1111 0000 to 1111 0100 5 different values

(3) The byte represents the continuation or the end of a sequence

64 different value between the following values:

128 to 191 80 to BF 1000 0000 to 1011 1111

Use od utility to visualize text files bytes

Display the bytes of a text file containing "Antoine" with the following command:

od -t xC hello.txt // hexadecimal bytes od -t uC hello.txt // decimal bytes

Result:

41  6e  74  6f  69  6e  65 // hexadecimal output 65 110 116 111 105 110 101 // decimal output