GNU Info

Info Node: (emacs)Recognize Coding

(emacs)Recognize Coding


Next: Specify Coding Prev: Coding Systems Up: International
Enter node , (file) or (file)node

Recognizing Coding Systems
==========================

   Emacs tries to recognize which coding system to use for a given text
as an integral part of reading that text.  (This applies to files being
read, output from subprocesses, text from X selections, etc.)  Emacs
can select the right coding system automatically most of the time--once
you have specified your preferences.

   Some coding systems can be recognized or distinguished by which byte
sequences appear in the data.  However, there are coding systems that
cannot be distinguished, not even potentially.  For example, there is no
way to distinguish between Latin-1 and Latin-2; they use the same byte
values with different meanings.

   Emacs handles this situation by means of a priority list of coding
systems.  Whenever Emacs reads a file, if you do not specify the coding
system to use, Emacs checks the data against each coding system,
starting with the first in priority and working down the list, until it
finds a coding system that fits the data.  Then it converts the file
contents assuming that they are represented in this coding system.

   The priority list of coding systems depends on the selected language
environment (Note: Language Environments).  For example, if you use
French, you probably want Emacs to prefer Latin-1 to Latin-2; if you use
Czech, you probably want Latin-2 to be preferred.  This is one of the
reasons to specify a language environment.

   However, you can alter the priority list in detail with the command
`M-x prefer-coding-system'.  This command reads the name of a coding
system from the minibuffer, and adds it to the front of the priority
list, so that it is preferred to all others.  If you use this command
several times, each use adds one element to the front of the priority
list.

   If you use a coding system that specifies the end-of-line conversion
type, such as `iso-8859-1-dos', what this means is that Emacs should
attempt to recognize `iso-8859-1' with priority, and should use DOS
end-of-line conversion when it does recognize `iso-8859-1'.

   Sometimes a file name indicates which coding system to use for the
file.  The variable `file-coding-system-alist' specifies this
correspondence.  There is a special function
`modify-coding-system-alist' for adding elements to this list.  For
example, to read and write all `.txt' files using the coding system
`china-iso-8bit', you can execute this Lisp expression:

     (modify-coding-system-alist 'file "\\.txt\\'" 'china-iso-8bit)

The first argument should be `file', the second argument should be a
regular expression that determines which files this applies to, and the
third argument says which coding system to use for these files.

   Emacs recognizes which kind of end-of-line conversion to use based on
the contents of the file: if it sees only carriage-returns, or only
carriage-return linefeed sequences, then it chooses the end-of-line
conversion accordingly.  You can inhibit the automatic use of
end-of-line conversion by setting the variable `inhibit-eol-conversion'
to non-`nil'.  If you do that, DOS-style files will be displayed with
the `^M' characters visible in the buffer; some people prefer this to
the more subtle `(DOS)' end-of-line type indication near the left edge
of the mode line (Note: eol-mnemonic.).

   By default, the automatic detection of coding system is sensitive to
escape sequences.  If Emacs sees a sequence of characters that begin
with an escape character, and the sequence is valid as an ISO-2022
code, that tells Emacs to use one of the ISO-2022 encodings to decode
the file.

   However, there may be cases that you want to read escape sequences
in a file as is.  In such a case, you can set the variable
`inhibit-iso-escape-detection' to non-`nil'.  Then the code detection
ignores any escape sequences, and never uses an ISO-2022 encoding.  The
result is that all escape sequences become visible in the buffer.

   The default value of `inhibit-iso-escape-detection' is `nil'.  We
recommend that you not change it permanently, only for one specific
operation.  That's because many Emacs Lisp source files in the Emacs
distribution contain non-ASCII characters encoded in the coding system
`iso-2022-7bit', and they won't be decoded correctly when you visit
those files if you suppress the escape sequence detection.

   You can specify the coding system for a particular file using the
`-*-...-*-' construct at the beginning of a file, or a local variables
list at the end (Note: File Variables).  You do this by defining a
value for the "variable" named `coding'.  Emacs does not really have a
variable `coding'; instead of setting a variable, this uses the
specified coding system for the file.  For example, `-*-mode: C;
coding: latin-1;-*-' specifies use of the Latin-1 coding system, as
well as C mode.  When you specify the coding explicitly in the file,
that overrides `file-coding-system-alist'.

   The variables `auto-coding-alist' and `auto-coding-regexp-alist' are
the strongest way to specify the coding system for certain patterns of
file names, or for files containing certain patterns; these variables
even override `-*-coding:-*-' tags in the file itself.  Emacs uses
`auto-coding-alist' for tar and archive files, to prevent it from being
confused by a `-*-coding:-*-' tag in a member of the archive and
thinking it applies to the archive file as a whole.  Likewise, Emacs
uses `auto-coding-regexp-alist' to ensure that RMAIL files, whose names
in general don't match any particular pattern, are decoded correctly.

   If Emacs recognizes the encoding of a file incorrectly, you can
reread the file using the correct coding system by typing `C-x <RET> c
CODING-SYSTEM <RET> M-x revert-buffer <RET>'.  To see what coding
system Emacs actually used to decode the file, look at the coding
system mnemonic letter near the left edge of the mode line (Note: Mode
Line), or type `C-h C <RET>'.

   Once Emacs has chosen a coding system for a buffer, it stores that
coding system in `buffer-file-coding-system' and uses that coding
system, by default, for operations that write from this buffer into a
file.  This includes the commands `save-buffer' and `write-region'.  If
you want to write files from this buffer using a different coding
system, you can specify a different coding system for the buffer using
`set-buffer-file-coding-system' (Note: Specify Coding).

   You can insert any possible character into any Emacs buffer, but
most coding systems can only handle some of the possible characters.
This means that it is possible for you to insert characters that cannot
be encoded with the coding system that will be used to save the buffer.
For example, you could start with an ASCII file and insert a few
Latin-1 characters into it, or you could edit a text file in Polish
encoded in `iso-8859-2' and add some Russian words to it.  When you
save the buffer, Emacs cannot use the current value of
`buffer-file-coding-system', because the characters you added cannot be
encoded by that coding system.

   When that happens, Emacs tries the most-preferred coding system (set
by `M-x prefer-coding-system' or `M-x set-language-environment'), and
if that coding system can safely encode all of the characters in the
buffer, Emacs uses it, and stores its value in
`buffer-file-coding-system'.  Otherwise, Emacs displays a list of
coding systems suitable for encoding the buffer's contents, and asks
you to choose one of those coding systems.

   If you insert the unsuitable characters in a mail message, Emacs
behaves a bit differently.  It additionally checks whether the
most-preferred coding system is recommended for use in MIME messages;
if not, Emacs tells you that the most-preferred coding system is not
recommended and prompts you for another coding system.  This is so you
won't inadvertently send a message encoded in a way that your
recipient's mail software will have difficulty decoding.  (If you do
want to use the most-preferred coding system, you can still type its
name in response to the question.)

   When you send a message with Mail mode (Note: Sending Mail), Emacs
has four different ways to determine the coding system to use for
encoding the message text.  It tries the buffer's own value of
`buffer-file-coding-system', if that is non-`nil'.  Otherwise, it uses
the value of `sendmail-coding-system', if that is non-`nil'.  The third
way is to use the default coding system for new files, which is
controlled by your choice of language environment, if that is
non-`nil'.  If all of these three values are `nil', Emacs encodes
outgoing mail using the Latin-1 coding system.

   When you get new mail in Rmail, each message is translated
automatically from the coding system it is written in, as if it were a
separate file.  This uses the priority list of coding systems that you
have specified.  If a MIME message specifies a character set, Rmail
obeys that specification, unless `rmail-decode-mime-charset' is `nil'.

   For reading and saving Rmail files themselves, Emacs uses the coding
system specified by the variable `rmail-file-coding-system'.  The
default value is `nil', which means that Rmail files are not translated
(they are read and written in the Emacs internal character code).


automatically generated by info2www version 1.2.2.9