GNU Info

Info Node: (elisp)Converting Representations

(elisp)Converting Representations


Next: Selecting a Representation Prev: Text Representations Up: Non-ASCII Characters
Enter node , (file) or (file)node

Converting Text Representations
===============================

   Emacs can convert unibyte text to multibyte; it can also convert
multibyte text to unibyte, though this conversion loses information.  In
general these conversions happen when inserting text into a buffer, or
when putting text from several strings together in one string.  You can
also explicitly convert a string's contents to either representation.

   Emacs chooses the representation for a string based on the text that
it is constructed from.  The general rule is to convert unibyte text to
multibyte text when combining it with other multibyte text, because the
multibyte representation is more general and can hold whatever
characters the unibyte text has.

   When inserting text into a buffer, Emacs converts the text to the
buffer's representation, as specified by `enable-multibyte-characters'
in that buffer.  In particular, when you insert multibyte text into a
unibyte buffer, Emacs converts the text to unibyte, even though this
conversion cannot in general preserve all the characters that might be
in the multibyte text.  The other natural alternative, to convert the
buffer contents to multibyte, is not acceptable because the buffer's
representation is a choice made by the user that cannot be overridden
automatically.

   Converting unibyte text to multibyte text leaves ASCII characters
unchanged, and likewise character codes 128 through 159.  It converts
the non-ASCII codes 160 through 255 by adding the value
`nonascii-insert-offset' to each character code.  By setting this
variable, you specify which character set the unibyte characters
correspond to (Note: Character Sets).  For example, if
`nonascii-insert-offset' is 2048, which is `(- (make-char
'latin-iso8859-1) 128)', then the unibyte non-ASCII characters
correspond to Latin 1.  If it is 2688, which is `(- (make-char
'greek-iso8859-7) 128)', then they correspond to Greek letters.

   Converting multibyte text to unibyte is simpler: it discards all but
the low 8 bits of each character code.  If `nonascii-insert-offset' has
a reasonable value, corresponding to the beginning of some character
set, this conversion is the inverse of the other: converting unibyte
text to multibyte and back to unibyte reproduces the original unibyte
text.

 - Variable: nonascii-insert-offset
     This variable specifies the amount to add to a non-ASCII character
     when converting unibyte text to multibyte.  It also applies when
     `self-insert-command' inserts a character in the unibyte non-ASCII
     range, 128 through 255.  However, the functions `insert' and
     `insert-char' do not perform this conversion.

     The right value to use to select character set CS is `(-
     (make-char CS) 128)'.  If the value of `nonascii-insert-offset' is
     zero, then conversion actually uses the value for the Latin 1
     character set, rather than zero.

 - Variable: nonascii-translation-table
     This variable provides a more general alternative to
     `nonascii-insert-offset'.  You can use it to specify independently
     how to translate each code in the range of 128 through 255 into a
     multibyte character.  The value should be a char-table, or `nil'.
     If this is non-`nil', it overrides `nonascii-insert-offset'.

 - Function: string-make-unibyte string
     This function converts the text of STRING to unibyte
     representation, if it isn't already, and returns the result.  If
     STRING is a unibyte string, it is returned unchanged.  Multibyte
     character codes are converted to unibyte by using just the low 8
     bits.

 - Function: string-make-multibyte string
     This function converts the text of STRING to multibyte
     representation, if it isn't already, and returns the result.  If
     STRING is a multibyte string, it is returned unchanged.  The
     function `unibyte-char-to-multibyte' is used to convert each
     unibyte character to a multibyte character.


automatically generated by info2www version 1.2.2.9