Whole document tree
    

Whole document tree

The Linux Console Tools: What is Unicode Next Previous Contents

4. What is Unicode

Traditionnaly, character encodings use 8 bits, and thus are limited to 256 characters. This causes problems because:

  1. it's not enough for some languages;
  2. people speaking languages using different encodings have to choose which one they use, and have to switch the system's state when changing the language, which makes it difficult to mix several languages in the same file;
  3. etc...

Thus the UCS (Universal Character Set), also know as Unicode was created to handle and mix all of our world's scripts. This is a 32-bit (4 bytes) encoding, otherwise known as UCS4 because of the size of its characters, which is normalised by ISO as the 10646-1 standard. The most widely used characters from UCS are contained in the UCS2 16-bit subset of UCS; this is the subset used by the Linux console.

For convenience, the UTF8 encoding was designed as a variable-length encoding (with 8 bytes of maximum length) with ASCII compatibility; all chars that have a UCS4 encoding can be expressed as a UTF8 sesquence, and vice-versa.

The Unicode consortium defines additional properties for UCS2 characters.

See: unicode(7), utf-8(7).


Next Previous Contents