(fftw.info)MPI Tips


MPI Tips
--------

   There are several things you should consider in order to get the best
performance out of the MPI FFTW routines.

   First, if possible, the first and second dimensions of your data
should be divisible by the number of processes you are using.  (If only
one can be divisible, then you should choose the first dimension.)
This allows the computational load to be spread evenly among the
processes, and also reduces the communications complexity and overhead.
In the one-dimensional transform case, the size of the transform
should ideally be divisible by the *square* of the number of processors.

   Second, you should consider using the `FFTW_TRANSPOSED_ORDER' output
format if it is not too burdensome.  The speed gains from
communications savings are usually substantial.

   Third, you should consider allocating a workspace for
`(r)fftw(nd)_mpi', as this can often (but not always) improve
performance (at the cost of extra storage).

   Fourth, you should experiment with the best number of processors to
use for your problem.  (There comes a point of diminishing returns,
when the communications costs outweigh the computational benefits.(1))
The `fftw_mpi_test' program can output helpful performance benchmarks.
It accepts the same parameters as the uniprocessor test programs (c.f.
`tests/README') and is run like an ordinary MPI program.  For example,
`mpirun -np 4 fftw_mpi_test -s 128x128x128' will benchmark a
`128x128x128' transform on four processors, reporting timings and
parallel speedups for all variants of `fftwnd_mpi' (transposed, with
workspace, etcetera).  (Note also that there is the `rfftw_mpi_test'
program for the real transforms.)

   ---------- Footnotes ----------

   (1) An FFT is particularly hard on communications systems, as it
requires an "all-to-all" communication, which is more or less the worst
possible case.

automatically generated by info2www version 1.2.2.9