GNU Info

Info Node: (g77-295.info)Gotchas (Transforming)

(g77-295.info)Gotchas (Transforming)


Next: TBD (Transforming) Prev: ste.c Up: Overview of Translation Process
Enter node , (file) or (file)node

Gotchas (Transforming)
----------------------

   This section is not about transforming "gotchas" into something else.
It is about the weirder aspects of transforming Fortran, however that's
defined, into a more modern, canonical form.

Multi-character Lexemes
.......................

   Each lexeme carries with it a pointer to where it appears in the
source.

   To provide the ability for diagnostics to point to column numbers,
in addition to line numbers and names, lexemes that represent more than
one (significant) character in the source code need, generally, to
provide pointers to where each *character* appears in the source.

   This provides the ability to properly identify the precise location
of the problem in code like

     SUBROUTINE X
     END
     BLOCK DATA X
     END

   which, in fixed-form source, would result in single lexemes
consisting of the strings `SUBROUTINEX' and `BLOCKDATAX'.  (The problem
is that `X' is defined twice, so a pointer to the `X' in the second
definition, as well as a follow-up pointer to the corresponding pointer
in the first, would be preferable to pointing to the beginnings of the
statements.)

   This need also arises when parsing (and diagnosing) `FORMAT'
statements.

   Further, it arises when diagnosing `FMT=' specifiers that contain
constants (or partial constants, or even propagated constants!)  in I/O
statements, as in:

     PRINT '(I2, 3HAB)', J

   (A pointer to the beginning of the prematurely-terminated Hollerith
constant, and/or to the close parenthese, is preferable to a pointer to
the open-parenthese or the apostrophe that precedes it.)

   Multi-character lexemes, which would seem to naturally include at
least digit strings, alphanumeric strings, `CHARACTER' constants, and
Hollerith constants, therefore need to provide location information on
each character.  (Maybe Hollerith constants don't, but it's unnecessary
to except them.)

   The question then arises, what about *other* multi-character lexemes,
such as `**' and `//', and Fortran 90's `(/', `/)', `::', and so on?

   Turns out there's a need to identify the location of the second
character of these two-character lexemes.  For example, in `I(/J) = K',
the slash needs to be diagnosed as the problem, not the open parenthese.
Similarly, it is preferable to diagnose the second slash in `I = J //
K' rather than the first, given the implicit typing rules, which would
result in the compiler disallowing the attempted concatenation of two
integers.  (Though, since that's more of a semantic issue, it's not
*that* much preferable.)

   Even sequences that could be parsed as digit strings could use
location info, for example, to diagnose the `9' in the octal constant
`O'129''.  (This probably will be parsed as a character string, to be
consistent with the parsing of `Z'129A''.)

   To avoid the hassle of recording the location of the second
character, while also preserving the general rule that each significant
character is distinctly pointed to by the lexeme that contains it, it's
best to simply not have any fixed-size lexemes larger than one
character.

   This new design is expected to make checking for two `*' lexemes in
a row much easier than the old design, so this is not much of a
sacrifice.  It probably makes the lexer much easier to implement than
it makes the parser harder.

Space-padding Lexemes
.....................

   Certain lexemes need to be padded with virtual spaces when the end
of the line (or file) is encountered.

   This is necessary in fixed form, to handle lines that don't extend
to column 72, assuming that's the line length in effect.

Bizarre Free-form Hollerith Constants
.....................................

   Last I checked, the Fortran 90 standard actually required the
compiler to silently accept something like

     FORMAT ( 1 2   Htwelve chars )

   as a valid `FORMAT' statement specifying a twelve-character
Hollerith constant.

   The implication here is that, since the new lexer is a zero-feedback
one, it won't know that the special case of a `FORMAT' statement being
parsed requires apparently distinct lexemes `1' and `2' to be treated as
a single lexeme.

   (This is a horrible misfeature of the Fortran 90 language.  It's one
of many such misfeatures that almost make me want to not support them,
and forge ahead with designing a new "GNU Fortran" language that has
the features, but not the misfeatures, of Fortran 90, and provide
utility programs to do the conversion automatically.)

   So, the lexer must gather distinct chunks of decimal strings into a
single lexeme in contexts where a single decimal lexeme might start a
Hollerith constant.

   (Which probably means it might as well do that all the time for all
multi-character lexemes, even in free-form mode, leaving it to
subsequent phases to pull them apart as they see fit.)

   Compare the treatment of this to how

     CHARACTER * 4 5 HEY

   and

     CHARACTER * 12 HEY

   must be treated--the former must be diagnosed, due to the separation
between lexemes, the latter must be accepted as a proper declaration.

Hollerith Constants
...................

   Recognizing a Hollerith constant--specifically, that an `H' or `h'
after a digit string begins such a constant--requires some knowledge of
context.

   Hollerith constants (such as `2HAB') can appear after:

   * `('

   * `,'

   * `='

   * `+', `-', `/'

   * `*', except as noted below

   Hollerith constants don't appear after:

   * `CHARACTER*', which can be treated generally as any `*' that is
     the second lexeme of a statement

Confusing Function Keyword
..........................

   While

     REAL FUNCTION FOO ()

   must be a `FUNCTION' statement and

     REAL FUNCTION FOO (5)

   must be a type-definition statement,

     REAL FUNCTION FOO (NAMES)

   where NAMES is a comma-separated list of names, can be one or the
other.

   The only way to disambiguate that statement (short of mandating
free-form source or a short maximum length for name for external
procedures) is based on the context of the statement.

   In particular, the statement is known to be within an
already-started program unit (but not at the outer level of the
`CONTAINS' block), it is a type-declaration statement.

   Otherwise, the statement is a `FUNCTION' statement, in that it
begins a function program unit (external, or, within `CONTAINS',
nested).

Weird READ
..........

   The statement

     READ (N)

   is equivalent to either

     READ (UNIT=(N))

   or

     READ (FMT=(N))

   depending on which would be valid in context.

   Specifically, if `N' is type `INTEGER', `READ (FMT=(N))' would not
be valid, because parentheses may not be used around `N', whereas they
may around it in `READ (UNIT=(N))'.

   Further, if `N' is type `CHARACTER', the opposite is true--`READ
(UNIT=(N))' is not valid, but `READ (FMT=(N))' is.

   Strictly speaking, if anything follows

     READ (N)

   in the statement, whether the first lexeme after the close
parenthese is a comma could be used to disambiguate the two cases,
without looking at the type of `N', because the comma is required for
the `READ (FMT=(N))' interpretation and disallowed for the `READ
(UNIT=(N))' interpretation.

   However, in practice, many Fortran compilers allow the comma for the
`READ (UNIT=(N))' interpretation anyway (in that they generally allow a
leading comma before an I/O list in an I/O statement), and much code
takes advantage of this allowance.

   (This is quite a reasonable allowance, since the juxtaposition of a
comma-separated list immediately after an I/O control-specification
list, which is also comma-separated, without an intervening comma,
looks sufficiently "wrong" to programmers that they can't resist the
itch to insert the comma.  `READ (I, J), K, L' simply looks cleaner than
`READ (I, J) K, L'.)

   So, type-based disambiguation is needed unless strict adherence to
the standard is always assumed, and we're not going to assume that.


automatically generated by info2www version 1.2.2.9