GNU Info

Info Node: (cppinternals-300.info)Lexer

(cppinternals-300.info)Lexer


Next: Whitespace Prev: Conventions Up: Top
Enter node , (file) or (file)node

The Lexer
*********

   The lexer is contained in the file `cpplex.c'.  We want to have a
lexer that is single-pass, for efficiency reasons.  We would also like
the lexer to only step forwards through the input files, and not step
back.  This will make future changes to support different character
sets, in particular state or shift-dependent ones, much easier.

   This file also contains all information needed to spell a token,
i.e. to output it either in a diagnostic or to a preprocessed output
file.  This information is not exported, but made available to clients
through such functions as `cpp_spell_token' and `cpp_token_len'.

   The most painful aspect of lexing ISO-standard C and C++ is handling
trigraphs and backlash-escaped newlines.  Trigraphs are processed before
any interpretation of the meaning of a character is made, and
unfortunately there is a trigraph representation for a backslash, so it
is possible for the trigraph `??/' to introduce an escaped newline.

   Escaped newlines are tedious because theoretically they can occur
anywhere--between the `+' and `=' of the `+=' token, within the
characters of an identifier, and even between the `*' and `/' that
terminates a comment.  Moreover, you cannot be sure there is just
one--there might be an arbitrarily long sequence of them.

   So the routine `parse_identifier', that lexes an identifier, cannot
assume that it can scan forwards until the first non-identifier
character and be done with it, because this could be the `\'
introducing an escaped newline, or the `?' introducing the trigraph
sequence that represents the `\' of an escaped newline.  Similarly for
the routine that handles numbers, `parse_number'.  If these routines
stumble upon a `?' or `\', they call `skip_escaped_newlines' to skip
over any potential escaped newlines before checking whether they can
finish.

   Similarly code in the main body of `_cpp_lex_token' cannot simply
check for a `=' after a `+' character to determine whether it has a
`+=' token; it needs to be prepared for an escaped newline of some
sort.  These cases use the function `get_effective_char', which returns
the first character after any intervening newlines.

   The lexer needs to keep track of the correct column position,
including counting tabs as specified by the `-ftabstop=' option.  This
should be done even within comments; C-style comments can appear in the
middle of a line, and we want to report diagnostics in the correct
position for text appearing after the end of the comment.

   Some identifiers, such as `__VA_ARGS__' and poisoned identifiers,
may be invalid and require a diagnostic.  However, if they appear in a
macro expansion we don't want to complain with each use of the macro.
It is therefore best to catch them during the lexing stage, in
`parse_identifier'.  In both cases, whether a diagnostic is needed or
not is dependent upon lexer state.  For example, we don't want to issue
a diagnostic for re-poisoning a poisoned identifier, or for using
`__VA_ARGS__' in the expansion of a variable-argument macro.  Therefore
`parse_identifier' makes use of flags to determine whether a diagnostic
is appropriate.  Since we change state on a per-token basis, and don't
lex whole lines at a time, this is not a problem.

   Another place where state flags are used to change behaviour is
whilst parsing header names.  Normally, a `<' would be lexed as a single
token.  After a `#include' directive, though, it should be lexed as a
single token as far as the nearest `>' character.  Note that we don't
allow the terminators of header names to be escaped; the first `"' or
`>' terminates the header name.

   Interpretation of some character sequences depends upon whether we
are lexing C, C++ or Objective-C, and on the revision of the standard in
force.  For example, `::' is a single token in C++, but two separate
`:' tokens, and almost certainly a syntax error, in C.  Such cases are
handled in the main function `_cpp_lex_token', based upon the flags set
in the `cpp_options' structure.

   Note we have almost, but not quite, achieved the goal of not stepping
backwards in the input stream.  Currently `skip_escaped_newlines' does
step back, though with care it should be possible to adjust it so that
this does not happen.  For example, one tricky issue is if we meet a
trigraph, but the command line option `-trigraphs' is not in force but
`-Wtrigraphs' is, we need to warn about it but then buffer it and
continue to treat it as 3 separate characters.


automatically generated by info2www version 1.2.2.9