GNU Info

Info Node: (elisp)Text Representations

(elisp)Text Representations


Next: Converting Representations Up: Non-ASCII Characters
Enter node , (file) or (file)node

Text Representations
====================

   Emacs has two "text representations"--two ways to represent text in
a string or buffer.  These are called "unibyte" and "multibyte".  Each
string, and each buffer, uses one of these two representations.  For
most purposes, you can ignore the issue of representations, because
Emacs converts text between them as appropriate.  Occasionally in Lisp
programming you will need to pay attention to the difference.

   In unibyte representation, each character occupies one byte and
therefore the possible character codes range from 0 to 255.  Codes 0
through 127 are ASCII characters; the codes from 128 through 255 are
used for one non-ASCII character set (you can choose which character
set by setting the variable `nonascii-insert-offset').

   In multibyte representation, a character may occupy more than one
byte, and as a result, the full range of Emacs character codes can be
stored.  The first byte of a multibyte character is always in the range
128 through 159 (octal 0200 through 0237).  These values are called
"leading codes".  The second and subsequent bytes of a multibyte
character are always in the range 160 through 255 (octal 0240 through
0377); these values are "trailing codes".

   Some sequences of bytes are not valid in multibyte text: for example,
a single isolated byte in the range 128 through 159 is not allowed.  But
character codes 128 through 159 can appear in multibyte text,
represented as two-byte sequences.  All the character codes 128 through
255 are possible (though slightly abnormal) in multibyte text; they
appear in multibyte buffers and strings when you do explicit encoding
and decoding (Note: Explicit Encoding).

   In a buffer, the buffer-local value of the variable
`enable-multibyte-characters' specifies the representation used.  The
representation for a string is determined and recorded in the string
when the string is constructed.

 - Variable: enable-multibyte-characters
     This variable specifies the current buffer's text representation.
     If it is non-`nil', the buffer contains multibyte text; otherwise,
     it contains unibyte text.

     You cannot set this variable directly; instead, use the function
     `set-buffer-multibyte' to change a buffer's representation.

 - Variable: default-enable-multibyte-characters
     This variable's value is entirely equivalent to `(default-value
     'enable-multibyte-characters)', and setting this variable changes
     that default value.  Setting the local binding of
     `enable-multibyte-characters' in a specific buffer is not allowed,
     but changing the default value is supported, and it is a reasonable
     thing to do, because it has no effect on existing buffers.

     The `--unibyte' command line option does its job by setting the
     default value to `nil' early in startup.

 - Function: position-bytes position
     Return the byte-position corresponding to buffer position POSITION
     in the current buffer.

 - Function: byte-to-position byte-position
     Return the buffer position corresponding to byte-position
     BYTE-POSITION in the current buffer.

 - Function: multibyte-string-p string
     Return `t' if STRING is a multibyte string.


automatically generated by info2www version 1.2.2.9