`gcc' and Pentium hacks
=======================
The `configure' option `--enable-i386-hacks' enables specific
optimizations for the Pentium and later x86 CPUs under gcc, which can
significantly improve performance of double-precision transforms.
Specifically, we have tested these hacks on Linux with `gcc' 2.[789]
and versions of `egcs' since 1.0.3. These optimizations affect only
the performance and not the correctness of FFTW (i.e. it is always safe
to try them out).
These hacks provide a workaround to the incorrect alignment of local
`double' variables in `gcc'. The compiler aligns these variables to
multiples of 4 bytes, but execution is much faster (on Pentium and
PentiumPro) if `double's are aligned to a multiple of 8 bytes. By
carefully counting the number of variables allocated by the compiler in
performance-critical regions of the code, we have been able to
introduce dummy allocations (using `alloca') that align the stack
properly. The hack depends crucially on the compiler flags that are
used. For example, it won't work without `-fomit-frame-pointer'.
In principle, these hacks are no longer required under `gcc'
versions 2.95 and later, which automatically align the stack correctly
(see `-mpreferred-stack-boundary' in the `gcc' manual). However, we
have encountered a
bug (http://egcs.cygnus.com/ml/gcc-bugs/1999-11/msg00259.html) in the
stack alignment of versions 2.95.[012] that causes FFTW's stack to be
misaligned under some circumstances. The `configure' script
automatically detects this bug and disables `gcc''s stack alignment in
favor of our own hacks when `--enable-i386-hacks' is used.
The `fftw_test' program outputs speed measurements that you can use
to see if these hacks are beneficial.
The `configure' option `--enable-pentium-timer' enables the use of
the Pentium and PentiumPro cycle counter for timing purposes. In order
to get correct results, you must define `FFTW_CYCLES_PER_SEC' in
`fftw/config.h' to be the clock speed of your processor; the resulting
FFTW library will be nonportable. The use of this option is
deprecated. On serious operating systems (such as Linux), FFTW uses
`gettimeofday()', which has enough resolution and is portable. (Note
that Win32 has its own high-resolution timing routines as well. FFTW
contains unsupported code to use these routines.)