GNU Info

Info Node: (gmp.info)FFT Multiplication

(gmp.info)FFT Multiplication


Next: Other Multiplication Prev: Toom-Cook 3-Way Multiplication Up: Multiplication Algorithms
Enter node , (file) or (file)node

FFT Multiplication
------------------

   At large to very large sizes a Fermat style FFT multiplication is
used, following Scho"nhage and Strassen (Note: References).
Descriptions of FFTs in various forms can be found in many textbooks,
for instance Knuth section 4.3.3 part C or Lipson chapter IX.  A brief
description of the form used in GMP is given here.

   The multiplication done is x*y mod 2^N+1, for a given N.  A full
product x*y is obtained by choosing N>=bits(x)+bits(y) and padding x
and y with high zero limbs.  The modular product is the native form for
the algorithm, so padding to get a full product is unavoidable.

   The algorithm follows a split, evaluate, pointwise multiply,
interpolate and combine similar to that described above for Karatsuba
and Toom-3.  A k parameter controls the split, with an FFT-k splitting
into 2^k pieces of M=N/2^k bits each.  N must be a multiple of
(2^k)*mp_bits_per_limb so the split falls on limb boundaries, avoiding
bit shifts in the split and combine stages.

   The evaluations, pointwise multiplications, and interpolation, are
all done modulo 2^N'+1 where N' is 2M+k+3 rounded up to a multiple of
2^k and of `mp_bits_per_limb'.  The results of interpolation will be
the following negacyclic convolution of the input pieces, and the
choice of N' ensures these sums aren't truncated.

                ---
                \         b
     w[n] =     /     (-1) * x[i] * y[j]
                ---
            i+j==b*2^k+n
               b=0,1

   The points used for the evaluation are g^i for i=0 to 2^k-1 where
g=2^(2N'/2^k).  g is a 2^k'th root of unity mod 2^N'+1, which produces
necessary cancellations at the interpolation stage, and it's also a
power of 2 so the fast fourier transforms used for the evaluation and
interpolation do only shifts, adds and negations.

   The pointwise multiplications are done modulo 2^N'+1 and either
recurse into a further FFT or use a plain multiplication (Toom-3,
Karatsuba or basecase), whichever is optimal at the size N'.  The
interpolation is an inverse fast fourier transform.  The resulting set
of sums of x[i]*y[j] are added at appropriate offsets to give the final
result.

   Squaring is the same, but x is the only input so it's one transform
at the evaluate stage and the pointwise multiplies are squares.  The
interpolation is the same.

   For a mod 2^N+1 product, an FFT-k is an O(N^(k/(k-1))) algorithm,
the exponent representing 2^k recursed modular multiplies each
1/2^(k-1) the size of the original.  Each successive k is an asymptotic
improvement, but overheads mean each is only faster at bigger and
bigger sizes.  In the code, `FFT_MUL_TABLE' and `FFT_SQR_TABLE' are the
thresholds where each k is used.  Each new k effectively swaps some
multiplying for some shifts, adds and overheads.

   A mod 2^N+1 product can be formed with a normal NxN->2N bit multiply
plus a subtraction, so an FFT and Toom-3 etc can be compared directly.
A k=4 FFT at O(N^1.333) can be expected to be the first faster than
Toom-3 at O(N^1.465).  In practice this is what's found, with
`FFT_MODF_MUL_THRESHOLD' and `FFT_MODF_SQR_THRESHOLD' being between 300
and 1000 limbs, depending on the CPU.  So far it's been found that only
very large FFTs recurse into pointwise multiplies above these sizes.

   When an FFT is to give a full product, the change of N to 2N doesn't
alter the theoretical complexity for a given k, but for the purposes of
considering where an FFT might be first used it can be assumed that the
FFT is recursing into a normal multiply and that on that basis it's
doing 2^k recursed multiplies each 1/2^(k-2) the size of the inputs,
making it O(N^(k/(k-2))).  This would mean k=7 at O(N^1.4) would be the
first FFT faster than Toom-3.  In practice `FFT_MUL_THRESHOLD' and
`FFT_SQR_THRESHOLD' have been found to be in the k=8 range, somewhere
between 3000 and 10000 limbs.

   The way N is split into 2^k pieces and then 2M+k+3 is rounded up to
a multiple of 2^k and `mp_bits_per_limb' means that when
2^k>=mp_bits_per_limb the effective N is a multiple of 2^(2k-1) bits.
The +k+3 means some values of N just under such a multiple will be
rounded to the next.  The complexity calculations above assume that a
favourable size is used, meaning one which isn't padded through
rounding, and it's also assumed that the extra +k+3 bits are negligible
at typical FFT sizes.

   The practical effect of the 2^(2k-1) constraint is to introduce a
step-effect into measured speeds.  For example k=8 will round N up to a
multiple of 32768 bits, so for a 32-bit limb there'll be 512 limb groups
of sizes for which `mpn_mul_n' runs at the same speed.  Or for k=9
groups of 2048 limbs, k=10 groups of 8192 limbs, etc.  In practice it's
been found each k is used at quite small multiples of its size
constraint and so the step effect is quite noticeable in a time versus
size graph.

   The threshold determinations currently measure at the mid-points of
size steps, but this is sub-optimal since at the start of a new step it
can happen that it's better to go back to the previous k for a while.
Something more sophisticated for `FFT_MUL_TABLE' and `FFT_SQR_TABLE'
will be needed.


automatically generated by info2www version 1.2.2.9