Whole document tree 4. What is UnicodeTraditionnaly, character encodings use 8 bits, and thus are limited to 256 characters. This causes problems because:
Thus the UCS (Universal Character Set), also know as Unicode was created to handle and mix all of our world's scripts. This is a 32-bit (4 bytes) encoding, otherwise known as UCS4 because of the size of its characters, which is normalised by ISO as the 10646-1 standard. The most widely used characters from UCS are contained in the UCS2 16-bit subset of UCS; this is the subset used by the Linux console. For convenience, the UTF8 encoding was designed as a variable-length encoding (with 8 bytes of maximum length) with ASCII compatibility; all chars that have a UCS4 encoding can be expressed as a UTF8 sesquence, and vice-versa. The Unicode consortium defines additional properties for UCS2 characters. See: Next Previous Contents |