Whole document tree
Parallel FFTWIn this chapter we discuss the use of FFTW in a parallel environment, documenting the different parallel libraries that we have provided. (Users calling FFTW from a multi-threaded program should also consult Section Thread safety.) The FFTW package currently contains three parallel transform implementations that leverage the uniprocessor FFTW code:
Multi-threaded FFTWIn this section we document the parallel FFTW routines for shared-memory threads on SMP hardware. These routines, which support parallel one- and multi-dimensional transforms of both real and complex data, are the easiest way to take advantage of multiple processors with FFTW. They work just like the corresponding uniprocessor transform routines, except that they take the number of parallel threads to use as an extra parameter. Any program that uses the uniprocessor FFTW can be trivially modified to use the multi-threaded FFTW. Installation and Supported Hardware/Software
All of the FFTW threads code is located in the
The threads routines require your operating system to have some sort of
shared-memory threads support. Specifically, the FFTW threads package
works with POSIX threads (available on most Unix variants, including
Linux), Solaris threads, BeOS threads (tested
on BeOS DR8.2), Mach C threads (reported to work by users), and Win32
threads (reported to work by users). (There is also untested code to
use MacOS MP threads.) If you have a shared-memory machine that uses a
different threads API, it should be a simple matter of programming to
include support for it; see the file SMP hardware is not required, although of course you need multiple processors to get any benefit from the multithreaded transforms. Usage of Multi-threaded FFTWHere, it is assumed that the reader is already familiar with the usage of the uniprocessor FFTW routines, described elsewhere in this manual. We only describe what one has to change in order to use the multi-threaded routines.
First, instead of including Second, before calling any FFTW routines, you should call the function: int fftw_threads_init(void);
This function, which should only be called once (probably in your
Third, when you want to actually compute the transform, you should use one of the following transform routines instead of the ordinary FFTW functions: fftw_threads(nthreads, plan, howmany, in, istride, idist, out, ostride, odist); fftw_threads_one(nthreads, plan, in, out); fftwnd_threads(nthreads, plan, howmany, in, istride, idist, out, ostride, odist); fftwnd_threads_one(nthreads, plan, in, out); rfftw_threads(nthreads, plan, howmany, in, istride, idist, out, ostride, odist); rfftw_threads_one(nthreads, plan, in, out); rfftwnd_threads_real_to_complex(nthreads, plan, howmany, in, istride, idist, out, ostride, odist); rfftwnd_threads_one_real_to_complex(nthreads, plan, in, out); rfftwnd_threads_complex_to_real(nthreads, plan, howmany, in, istride, idist, out, ostride, odist); rfftwnd_threads_one_real_to_complex(nthreads, plan, in, out); rfftwnd_threads_one_complex_to_real(nthreads, plan, in, out);
All of these routines take exactly the same arguments and have exactly
the same effects as their uniprocessor counterparts (i.e. without the
`
For example, to parallelize a single one-dimensional transform of
complex data, instead of calling the uniprocessor These are the only changes you need to make to your source code. Calls to all other FFTW routines (plan creation, destruction, wisdom, etcetera) are not parallelized and remain the same. (The same plans and wisdom are used by both uniprocessor and multi-threaded transforms.) Your arrays are allocated and formatted in the same way, and so on.
Programs using the parallel complex transforms should be linked with
How Many Threads to Use?There is a fair amount of overhead involved in spawning and synchronizing threads, so the optimal number of threads to use depends upon the size of the transform as well as on the number of processors you have. As a general rule, you don't want to use more threads than you have processors. (Using more threads will work, but there will be extra overhead with no benefit.) In fact, if the problem size is too small, you may want to use fewer threads than you have processors.
You will have to experiment with your system to see what level of
parallelization is best for your problem size. Useful tools to help you
do this are the test programs that are automatically compiled along with
the threads libraries, fftw_threads_test 2 -s 128x128 will benchmark complex 128x128 transforms using two threads and report the speedup relative to the uniprocessor transform. For instance, on a 4-processor 200MHz Pentium Pro system running Linux 2.2.0, we found that the "crossover" point at which 2 threads became beneficial for complex transforms was about 4k points, while 4 threads became beneficial at 8k points. Using Multi-threaded FFTW in a Multi-threaded ProgramIt is perfectly possible to use the multi-threaded FFTW routines from a multi-threaded program (e.g. have multiple threads computing multi-threaded transforms simultaneously). If you have the processors, more power to you! However, the same restrictions apply as for the uniprocessor FFTW routines (see Section Thread safety). In particular, you should recall that you may not create or destroy plans in parallel. Tips for Optimal ThreadingNot all transforms are equally well-parallelized by the multi-threaded FFTW routines. (This is merely a consequence of laziness on the part of the implementors, and is not inherent to the algorithms employed.) Mainly, the limitations are in the parallel one-dimensional transforms. The things to avoid if you want optimal parallelization are as follows: Parallelization deficiencies in one-dimensional transforms
MPI FFTWThis section describes the MPI FFTW routines for distributed-memory (and shared-memory) machines supporting MPI (Message Passing Interface). The MPI routines are significantly different from the ordinary FFTW because the transform data here are distributed over multiple processes, so that each process gets only a portion of the array. Currently, multi-dimensional transforms of both real and complex data, as well as one-dimensional transforms of complex data, are supported. MPI FFTW Installation
The FFTW MPI library code is all located in the The only requirement of the FFTW MPI code is that you have the standard MPI 1.1 (or later) libraries and header files installed on your system. A free implementation of MPI is available from the MPICH home page.
Previous versions of the FFTW MPI routines have had an unfortunate
tendency to expose bugs in MPI implementations. The current version has
been largely rewritten, and hopefully avoids some of the problems. If
you run into difficulties, try passing the optional workspace to
Several test programs are included in the Usage of MPI FFTW for Complex Multi-dimensional TransformsUsage of the MPI FFTW routines is similar to that of the uniprocessor FFTW. We assume that the reader already understands the usage of the uniprocessor FFTW routines, described elsewhere in this manual. Some familiarity with MPI is also helpful. A typical program performing a complex two-dimensional MPI transform might look something like: #include <fftw_mpi.h> int main(int argc, char **argv) { const int NX = ..., NY = ...; fftwnd_mpi_plan plan; fftw_complex *data; MPI_Init(&argc,&argv); plan = fftw2d_mpi_create_plan(MPI_COMM_WORLD, NX, NY, FFTW_FORWARD, FFTW_ESTIMATE); ...allocate and initialize data... fftwnd_mpi(p, 1, data, NULL, FFTW_NORMAL_ORDER); ... fftwnd_mpi_destroy_plan(plan); MPI_Finalize(); }
The calls to
As in the ordinary FFTW, the first thing we do is to create a plan (of
type fftwnd_mpi_plan fftw2d_mpi_create_plan(MPI_Comm comm, int nx, int ny, fftw_direction dir, int flags);
Except for the first argument, the parameters are identical to those of
Next, one has to allocate and initialize the data. This is somewhat tricky, because the transform data is distributed across the processes involved in the transform. It is discussed in detail by the next section (see Section MPI Data Layout).
The actual computation of the transform is performed by the function
void fftwnd_mpi(fftwnd_mpi_plan p, int n_fields, fftw_complex *local_data, fftw_complex *work, fftwnd_mpi_output_order output_order); There are several things to notice here:
The output of
The same plan can be used to compute many transforms of the same size.
After you are done with it, you should deallocate it by calling
Important: The FFTW MPI routines must be called in the same order by
all processes involved in the transform. You should assume that they
all are blocking, as if each contained a call to
Programs using the FFTW MPI routines should be linked with
MPI Data LayoutThe transform data used by the MPI FFTW routines is distributed: a distinct portion of it resides with each process involved in the transform. This allows the transform to be parallelized, for example, over a cluster of workstations, each with its own separate memory, so that you can take advantage of the total memory of all the processors you are parallelizing over.
In particular, the array is divided according to the rows (first
dimension) of the data: each process gets a subset of the rows of the
data. (This is sometimes called a "slab decomposition.") One
consequence of this is that you can't take advantage of more processors
than you have rows (e.g.
Below, the first dimension of the data will be referred to as
` FFTW supplies a routine to tell you exactly how much data resides on the current process: void fftwnd_mpi_local_sizes(fftwnd_mpi_plan p, int *local_nx, int *local_x_start, int *local_ny_after_transpose, int *local_y_start_after_transpose, int *total_local_size);
Given a plan
The data on the current process has
For instance, suppose you want to transform three-dimensional data of
size
The following is an example of allocating such a three-dimensional array
array ( fftwnd_mpi_local_sizes(plan, &local_nx, &local_x_start, &local_ny_after_transpose, &local_y_start_after_transpose, &total_local_size); local_data = (fftw_complex*) malloc(sizeof(fftw_complex) * total_local_size); for (x = 0; x < local_nx; ++x) for (y = 0; y < ny; ++y) for (z = 0; z < nz; ++z) local_data[(x*ny + y)*nz + z] = f(x + local_x_start, y, z); Some important things to remember:
Usage of MPI FFTW for Real Multi-dimensional Transforms
MPI transforms specialized for real data are also available, similiar to
the uniprocessor
The following is an example of a program using #include <rfftw_mpi.h> int main(int argc, char **argv) { const int nx = ..., ny = ..., nz = ...; int local_nx, local_x_start, local_ny_after_transpose, local_y_start_after_transpose, total_local_size; int x, y, z; rfftwnd_mpi_plan plan, iplan; fftw_real *data, *work; fftw_complex *cdata; MPI_Init(&argc,&argv); /* create the forward and backward plans: */ plan = rfftw3d_mpi_create_plan(MPI_COMM_WORLD, nx, ny, nz, FFTW_REAL_TO_COMPLEX, FFTW_ESTIMATE); iplan = rfftw3d_mpi_create_plan(MPI_COMM_WORLD, /* dim.'s of REAL data --> */ nx, ny, nz, FFTW_COMPLEX_TO_REAL, FFTW_ESTIMATE); rfftwnd_mpi_local_sizes(plan, &local_nx, &local_x_start, &local_ny_after_transpose, &local_y_start_after_transpose, &total_local_size); data = (fftw_real*) malloc(sizeof(fftw_real) * total_local_size); /* workspace is the same size as the data: */ work = (fftw_real*) malloc(sizeof(fftw_real) * total_local_size); /* initialize data to f(x,y,z): */ for (x = 0; x < local_nx; ++x) for (y = 0; y < ny; ++y) for (z = 0; z < nz; ++z) data[(x*ny + y) * (2*(nz/2+1)) + z] = f(x + local_x_start, y, z); /* Now, compute the forward transform: */ rfftwnd_mpi(plan, 1, data, work, FFTW_TRANSPOSED_ORDER); /* the data is now complex, so typecast a pointer: */ cdata = (fftw_complex*) data; /* multiply imaginary part by 2, for fun: (note that the data is transposed) */ for (y = 0; y < local_ny_after_transpose; ++y) for (x = 0; x < nx; ++x) for (z = 0; z < (nz/2+1); ++z) cdata[(y*nx + x) * (nz/2+1) + z].im *= 2.0; /* Finally, compute the inverse transform; the result is transposed back to the original data layout: */ rfftwnd_mpi(iplan, 1, data, work, FFTW_TRANSPOSED_ORDER); free(data); free(work); rfftwnd_mpi_destroy_plan(plan); rfftwnd_mpi_destroy_plan(iplan); MPI_Finalize(); }
There's a lot of stuff in this example, but it's all just what you would
have guessed, right? We replaced all the Some other important things to know about the real MPI transforms:
Programs using the MPI FFTW real transforms should link with
Usage of MPI FFTW for Complex One-dimensional TransformsThe MPI FFTW also includes routines for parallel one-dimensional transforms of complex data (only). Although the speedup is generally worse than it is for the multi-dimensional routines,(6) these distributed-memory one-dimensional transforms are especially useful for performing one-dimensional transforms that don't fit into the memory of a single machine.
The usage of these routines is straightforward, and is similar to that
of the multi-dimensional MPI transform functions. You first include the
header fftw_mpi_plan fftw_mpi_create_plan(MPI_Comm comm, int n, fftw_direction dir, int flags);
The last three arguments are the same as for
If you don't care about the ordering of the input or output data of the
transform, you can include The transform itself is computed by: void fftw_mpi(fftw_mpi_plan p, int n_fields, fftw_complex *local_data, fftw_complex *work);
To find out what portion of the array is stored local to the current process, you call the following routine: void fftw_mpi_local_sizes(fftw_mpi_plan p, int *local_n, int *local_start, int *local_n_after_transform, int *local_start_after_transform, int *total_local_size);
Note that, if you compute both a forward and a backward transform of the same size, the local sizes are guaranteed to be consistent. That is, the local size after the forward transform will be the same as the local size before the backward transform, and vice versa.
Programs using the FFTW MPI routines should be linked with
MPI TipsThere are several things you should consider in order to get the best performance out of the MPI FFTW routines. First, if possible, the first and second dimensions of your data should be divisible by the number of processes you are using. (If only one can be divisible, then you should choose the first dimension.) This allows the computational load to be spread evenly among the processes, and also reduces the communications complexity and overhead. In the one-dimensional transform case, the size of the transform should ideally be divisible by the square of the number of processors.
Second, you should consider using the
Third, you should consider allocating a workspace for
Fourth, you should experiment with the best number of processors to use
for your problem. (There comes a point of diminishing returns, when the
communications costs outweigh the computational benefits.(7)) The Go to the first, previous, next, last section, table of contents. |