GNU Info

Info Node: (g77-300.info)lex.c

(g77-300.info)lex.c


Next: sta.c Prev: g77stripcard Up: Overview of Translation Process
Enter node , (file) or (file)node

lex.c
-----

   To help make the lexer simple, fast, and easy to maintain, while
also having `g77' generally encourage Fortran programmers to write
simple, maintainable, portable code by maximizing the performance of
compiling that kind of code:

   * There'll be just one lexer, for both fixed-form and free-form
     source.

   * It'll care about the form only when handling the first 7 columns of
     text, stuff like spaces between strings of alphanumerics, and how
     lines are continued.

     Some other distinctions will be handled by subsequent phases, so
     at least one of them will have to know which form is involved.

     For example, `I = 2 . 4' is acceptable in fixed form, and works in
     free form as well given the implementation `g77' presently uses.
     But the standard requires a diagnostic for it in free form, so the
     parser has to be able to recognize that the lexemes aren't
     contiguous (information the lexer _does_ have to provide) and that
     free-form source is being parsed, so it can provide the diagnostic.

     The `g77' lexer doesn't try to gather `2 . 4' into a single lexeme.
     Otherwise, it'd have to know a whole lot more about how to parse
     Fortran, or subsequent phases (mainly parsing) would have two
     paths through lots of critical code--one to handle the lexeme `2',
     `.', and `4' in sequence, another to handle the lexeme `2.4'.

   * It won't worry about line lengths (beyond the first 7 columns for
     fixed-form source).

     That is, once it starts parsing the "statement" part of a line
     (column 7 for fixed-form, column 1 for free-form), it'll keep
     going until it finds a newline, rather than ignoring everything
     past a particular column (72 or 132).

     The implication here is that there shouldn't _be_ anything past
     that last column, other than whitespace or commentary, because
     users using typical editors (or viewing output as typically
     printed) won't necessarily know just where the last column is.

     Code that has "garbage" beyond the last column (almost certainly
     only fixed-form code with a punched-card legacy, such as code
     using columns 73-80 for "sequence numbers") will have to be run
     through `g77stripcard' first.

     Also, keeping track of the maximum column position while also
     watching out for the end of a line _and_ while reading from a file
     just makes things slower.  Since a file must be read, and watching
     for the end of the line is necessary (unless the typical input
     file was preprocessed to include the necessary number of trailing
     spaces), dropping the tracking of the maximum column position is
     the only way to reduce the complexity of the pertinent code while
     maintaining high performance.

   * ASCII encoding is assumed for the input file.

     Code written in other character sets will have to be converted
     first.

   * Tabs (ASCII code 9) will be converted to spaces via the
     straightforward approach.

     Specifically, a tab is converted to between one and eight spaces
     as necessary to reach column N, where dividing `(N - 1)' by eight
     results in a remainder of zero.

     That saves having to pass most source files through `expand'.

   * Linefeeds (ASCII code 10) mark the ends of lines.

   * A carriage return (ASCII code 13) is accept if it immediately
     precedes a linefeed, in which case it is ignored.

     Otherwise, it is rejected (with a diagnostic).

   * Any other characters other than the above that are not part of the
     GNU Fortran Character Set (Note: Character Set) are rejected
     with a diagnostic.

     This includes backspaces, form feeds, and the like.

     (It might make sense to allow a form feed in column 1 as long as
     that's the only character on a line.  It certainly wouldn't seem
     to cost much in terms of performance.)

   * The end of the input stream (EOF) ends the current line.

   * The distinction between uppercase and lowercase letters will be
     preserved.

     It will be up to subsequent phases to decide to fold case.

     Current plans are to permit any casing for Fortran (reserved)
     keywords while preserving casing for user-defined names.  (This
     might not be made the default for `.f' files, though.)

     Preserving case seems necessary to provide more direct access to
     facilities outside of `g77', such as to C or Pascal code.

     Names of intrinsics will probably be matchable in any case,
     However, there probably won't be any option to require a
     particular mixed-case appearance of intrinsics (as there was for
     `g77' prior to version 0.6), because that's painful to maintain,
     and probably nobody uses it.

     (How `external SiN; r = sin(x)' would be handled is TBD.  I think
     old `g77' might already handle that pretty elegantly, but whether
     we can cope with allowing the same fragment to reference a
     _different_ procedure, even with the same interface, via `s =
     SiN(r)', needs to be determined.  If it can't, we need to make
     sure that when code introduces a user-defined name, any intrinsic
     matching that name using a case-insensitive comparison is "turned
     off".)

   * Backslashes in `CHARACTER' and Hollerith constants are not allowed.

     This avoids the confusion introduced by some Fortran compiler
     vendors providing C-like interpretation of backslashes, while
     others provide straight-through interpretation.

     Some kind of lexical construct (TBD) will be provided to allow
     flagging of a `CHARACTER' (but probably not a Hollerith) constant
     that permits backslashes.  It'll necessarily be a prefix, such as:

          PRINT *, C'This line has a backspace \b here.'
          PRINT *, F'This line has a straight backslash \ here.'

     Further, command-line options might be provided to specify that
     one prefix or the other is to be assumed as the default for
     `CHARACTER' constants.

     However, it seems more helpful for `g77' to provide a program that
     converts prefix all constants (or just those containing
     backslashes) with the desired designation, so printouts of code
     can be read without knowing the compile-time options used when
     compiling it.

     If such a program is provided (let's name it `g77slash' for now),
     then a command-line option to `g77' should not be provided.
     (Though, given that it'll be easy to implement, it might be hard
     to resist user requests for it "to compile faster than if we have
     to invoke another filter".)

     This program would take a command-line option to specify the
     default interpretation of slashes, affecting which prefix it uses
     for constants.

     `g77slash' probably should automatically convert Hollerith
     constants that contain slashes to the appropriate `CHARACTER'
     constants.  Then `g77' wouldn't have to define a prefix syntax for
     Hollerith constants specifying whether they want C-style or
     straight-through backslashes.

   * To allow for form-neutral INCLUDE files without requiring them to
     be preprocessed, the fixed-form lexer should offer an extension
     (if possible) allowing a trailing `&' to be ignored, especially if
     after column 72, as it would be using the traditional Unix Fortran
     source model (which ignores _everything_ after column 72).

   The above implements nearly exactly what is specified by Note:
Character Set, and Note: Lines, except it also provides automatic
conversion of tabs and ignoring of newline-related carriage returns, as
well as accommodating form-neutral INCLUDE files.

   It also implements the "pure visual" model, by which is meant that a
user viewing his code in a typical text editor (assuming it's not
preprocessed via `g77stripcard' or similar) doesn't need any special
knowledge of whether spaces on the screen are really tabs, whether
lines end immediately after the last visible non-space character or
after a number of spaces and tabs that follow it, or whether the last
line in the file is ended by a newline.

   Most editors don't make these distinctions, the ANSI FORTRAN 77
standard doesn't require them to, and it permits a standard-conforming
compiler to define a method for transforming source code to "standard
form" however it wants.

   So, GNU Fortran defines it such that users have the best chance of
having the code be interpreted the way it looks on the screen of the
typical editor.

   (Fancy editors should _never_ be required to correctly read code
written in classic two-dimensional-plaintext form.  By correct reading
I mean ability to read it, book-like, without mistaking text ignored by
the compiler for program code and vice versa, and without having to
count beyond the first several columns.  The vague meaning of ASCII
TAB, among other things, complicates this somewhat, but as long as
"everyone", including the editor, other tools, and printer, agrees
about the every-eighth-column convention, the GNU Fortran "pure visual"
model meets these requirements.  Any language or user-visible source
form requiring special tagging of tabs, the ends of lines after
spaces/tabs, and so on, fails to meet this fairly straightforward
specification.  Fortunately, Fortran _itself_ does not mandate such a
failure, though most vendor-supplied defaults for their Fortran
compilers _do_ fail to meet this specification for readability.)

   Further, this model provides a clean interface to whatever
preprocessors or code-generators are used to produce input to this
phase of `g77'.  Mainly, they need not worry about long lines.


automatically generated by info2www version 1.2.2.9