Tips for Optimal Threading
--------------------------
Not all transforms are equally well-parallelized by the
multi-threaded FFTW routines. (This is merely a consequence of
laziness on the part of the implementors, and is not inherent to the
algorithms employed.) Mainly, the limitations are in the parallel
one-dimensional transforms. The things to avoid if you want optimal
parallelization are as follows:
Parallelization deficiencies in one-dimensional transforms
----------------------------------------------------------
* Large prime factors can sometimes parallelize poorly. Of course,
you should avoid these anyway if you want high performance.
* Single in-place transforms don't parallelize completely. (Multiple
in-place transforms, i.e. `howmany > 1', are fine.) Again, you
should avoid these in any case if you want high performance, as
they require transforming to a scratch array and copying back.
* Single real-complex (`rfftw') transforms don't parallelize
completely. This is unfortunate, but parallelizing this correctly
would have involved a lot of extra code (and a much larger
library). You still get some benefit from additional processors,
but if you have a very large number of processors you will
probably be better off using the parallel complex (`fftw')
transforms. Note that multi-dimensional real transforms or
multiple one-dimensional real transforms are fine.