lex.c
-----
To help make the lexer simple, fast, and easy to maintain, while
also having `g77' generally encourage Fortran programmers to write
simple, maintainable, portable code by maximizing the performance of
compiling that kind of code:
* There'll be just one lexer, for both fixed-form and free-form
source.
* It'll care about the form only when handling the first 7 columns of
text, stuff like spaces between strings of alphanumerics, and how
lines are continued.
Some other distinctions will be handled by subsequent phases, so
at least one of them will have to know which form is involved.
For example, `I = 2 . 4' is acceptable in fixed form, and works in
free form as well given the implementation `g77' presently uses.
But the standard requires a diagnostic for it in free form, so the
parser has to be able to recognize that the lexemes aren't
contiguous (information the lexer _does_ have to provide) and that
free-form source is being parsed, so it can provide the diagnostic.
The `g77' lexer doesn't try to gather `2 . 4' into a single lexeme.
Otherwise, it'd have to know a whole lot more about how to parse
Fortran, or subsequent phases (mainly parsing) would have two
paths through lots of critical code--one to handle the lexeme `2',
`.', and `4' in sequence, another to handle the lexeme `2.4'.
* It won't worry about line lengths (beyond the first 7 columns for
fixed-form source).
That is, once it starts parsing the "statement" part of a line
(column 7 for fixed-form, column 1 for free-form), it'll keep
going until it finds a newline, rather than ignoring everything
past a particular column (72 or 132).
The implication here is that there shouldn't _be_ anything past
that last column, other than whitespace or commentary, because
users using typical editors (or viewing output as typically
printed) won't necessarily know just where the last column is.
Code that has "garbage" beyond the last column (almost certainly
only fixed-form code with a punched-card legacy, such as code
using columns 73-80 for "sequence numbers") will have to be run
through `g77stripcard' first.
Also, keeping track of the maximum column position while also
watching out for the end of a line _and_ while reading from a file
just makes things slower. Since a file must be read, and watching
for the end of the line is necessary (unless the typical input
file was preprocessed to include the necessary number of trailing
spaces), dropping the tracking of the maximum column position is
the only way to reduce the complexity of the pertinent code while
maintaining high performance.
* ASCII encoding is assumed for the input file.
Code written in other character sets will have to be converted
first.
* Tabs (ASCII code 9) will be converted to spaces via the
straightforward approach.
Specifically, a tab is converted to between one and eight spaces
as necessary to reach column N, where dividing `(N - 1)' by eight
results in a remainder of zero.
That saves having to pass most source files through `expand'.
* Linefeeds (ASCII code 10) mark the ends of lines.
* A carriage return (ASCII code 13) is accept if it immediately
precedes a linefeed, in which case it is ignored.
Otherwise, it is rejected (with a diagnostic).
* Any other characters other than the above that are not part of the
GNU Fortran Character Set (Note:Character Set) are rejected
with a diagnostic.
This includes backspaces, form feeds, and the like.
(It might make sense to allow a form feed in column 1 as long as
that's the only character on a line. It certainly wouldn't seem
to cost much in terms of performance.)
* The end of the input stream (EOF) ends the current line.
* The distinction between uppercase and lowercase letters will be
preserved.
It will be up to subsequent phases to decide to fold case.
Current plans are to permit any casing for Fortran (reserved)
keywords while preserving casing for user-defined names. (This
might not be made the default for `.f' files, though.)
Preserving case seems necessary to provide more direct access to
facilities outside of `g77', such as to C or Pascal code.
Names of intrinsics will probably be matchable in any case,
However, there probably won't be any option to require a
particular mixed-case appearance of intrinsics (as there was for
`g77' prior to version 0.6), because that's painful to maintain,
and probably nobody uses it.
(How `external SiN; r = sin(x)' would be handled is TBD. I think
old `g77' might already handle that pretty elegantly, but whether
we can cope with allowing the same fragment to reference a
_different_ procedure, even with the same interface, via `s =
SiN(r)', needs to be determined. If it can't, we need to make
sure that when code introduces a user-defined name, any intrinsic
matching that name using a case-insensitive comparison is "turned
off".)
* Backslashes in `CHARACTER' and Hollerith constants are not allowed.
This avoids the confusion introduced by some Fortran compiler
vendors providing C-like interpretation of backslashes, while
others provide straight-through interpretation.
Some kind of lexical construct (TBD) will be provided to allow
flagging of a `CHARACTER' (but probably not a Hollerith) constant
that permits backslashes. It'll necessarily be a prefix, such as:
PRINT *, C'This line has a backspace \b here.'
PRINT *, F'This line has a straight backslash \ here.'
Further, command-line options might be provided to specify that
one prefix or the other is to be assumed as the default for
`CHARACTER' constants.
However, it seems more helpful for `g77' to provide a program that
converts prefix all constants (or just those containing
backslashes) with the desired designation, so printouts of code
can be read without knowing the compile-time options used when
compiling it.
If such a program is provided (let's name it `g77slash' for now),
then a command-line option to `g77' should not be provided.
(Though, given that it'll be easy to implement, it might be hard
to resist user requests for it "to compile faster than if we have
to invoke another filter".)
This program would take a command-line option to specify the
default interpretation of slashes, affecting which prefix it uses
for constants.
`g77slash' probably should automatically convert Hollerith
constants that contain slashes to the appropriate `CHARACTER'
constants. Then `g77' wouldn't have to define a prefix syntax for
Hollerith constants specifying whether they want C-style or
straight-through backslashes.
* To allow for form-neutral INCLUDE files without requiring them to
be preprocessed, the fixed-form lexer should offer an extension
(if possible) allowing a trailing `&' to be ignored, especially if
after column 72, as it would be using the traditional Unix Fortran
source model (which ignores _everything_ after column 72).
The above implements nearly exactly what is specified by Note:Character Set, and Note:Lines, except it also provides automatic
conversion of tabs and ignoring of newline-related carriage returns, as
well as accommodating form-neutral INCLUDE files.
It also implements the "pure visual" model, by which is meant that a
user viewing his code in a typical text editor (assuming it's not
preprocessed via `g77stripcard' or similar) doesn't need any special
knowledge of whether spaces on the screen are really tabs, whether
lines end immediately after the last visible non-space character or
after a number of spaces and tabs that follow it, or whether the last
line in the file is ended by a newline.
Most editors don't make these distinctions, the ANSI FORTRAN 77
standard doesn't require them to, and it permits a standard-conforming
compiler to define a method for transforming source code to "standard
form" however it wants.
So, GNU Fortran defines it such that users have the best chance of
having the code be interpreted the way it looks on the screen of the
typical editor.
(Fancy editors should _never_ be required to correctly read code
written in classic two-dimensional-plaintext form. By correct reading
I mean ability to read it, book-like, without mistaking text ignored by
the compiler for program code and vice versa, and without having to
count beyond the first several columns. The vague meaning of ASCII
TAB, among other things, complicates this somewhat, but as long as
"everyone", including the editor, other tools, and printer, agrees
about the every-eighth-column convention, the GNU Fortran "pure visual"
model meets these requirements. Any language or user-visible source
form requiring special tagging of tabs, the ends of lines after
spaces/tabs, and so on, fails to meet this fairly straightforward
specification. Fortunately, Fortran _itself_ does not mandate such a
failure, though most vendor-supplied defaults for their Fortran
compilers _do_ fail to meet this specification for readability.)
Further, this model provides a clean interface to whatever
preprocessors or code-generators are used to produce input to this
phase of `g77'. Mainly, they need not worry about long lines.