GNU Info

Info Node: (gmp.info)Toom-Cook 3-Way Multiplication

(gmp.info)Toom-Cook 3-Way Multiplication


Next: FFT Multiplication Prev: Karatsuba Multiplication Up: Multiplication Algorithms
Enter node , (file) or (file)node

Toom-Cook 3-Way Multiplication
------------------------------

   The Karatsuba formula is the simplest case of a general approach to
splitting inputs that leads to both Toom-Cook and FFT algorithms.  A
description of Toom-Cook can be found in Knuth section 4.3.3, with an
example 3-way calculation after Theorem A.  The 3-way form used in GMP
is described here.

   The operands are each considered split into 3 pieces of equal length
(or the most significant part 1 or 2 limbs shorter than the others).

      high                         low
     +----------+----------+----------+
     |    x2    |    x1    |    x0    |
     +----------+----------+----------+
     
     +----------+----------+----------+
     |    y2    |    y1    |    y0    |
     +----------+----------+----------+

These parts are treated as the coefficients of two polynomials

     X(t) = x2*t^2 + x1*t + x0
     Y(t) = y2*t^2 + y1*t + y0

   Again let b equal the power of 2 which is the size of the x0, x1, y0
and y1 pieces, ie. if they're k limbs each then
b=2^(k*mp_bits_per_limb).  With this x=X(b) and y=Y(b).

   Let a polynomial W(t)=X(t)*Y(t) and suppose its coefficients are

     W(t) = w4*t^4 + w3*t^3 + w2*t^2 + w1*t + w0

The w[i] are going to be determined, and when they are they'll give the
final result using w=W(b), since x*y=X(b)*Y(b)=W(b).  The coefficients
will be roughly b^2 each, and the final W(b) will be an addition like,

      high                                        low
     +-------+-------+
     |       w4      |
     +-------+-------+
            +--------+-------+
            |        w3      |
            +--------+-------+
                    +--------+-------+
                    |        w2      |
                    +--------+-------+
                            +--------+-------+
                            |        w1      |
                            +--------+-------+
                                     +-------+-------+
                                     |       w0      |
                                     +-------+-------+

   The w[i] coefficients could be formed by a simple set of cross
products, like w4=x2*y2, w3=x2*y1+x1*y2, w2=x2*y0+x1*y1+x0*y2 etc, but
this would need all nine x[i]*y[j] for i,j=0,1,2, and would be
equivalent merely to a basecase multiply.  Instead the following
approach is used.

   X(t) and Y(t) are evaluated and multiplied at 5 points, giving
values of W(t) at those points.  The points used can be chosen in
various ways, but in GMP the following are used

     Point    Value
     t=0      x0*y0, which gives w0 immediately
     t=2      (4*x2+2*x1+x0)*(4*y2+2*y1+y0)
     t=1      (x2+x1+x0)*(y2+y1+y0)
     t=1/2    (x2+2*x1+4*x0)*(y2+2*y1+4*y0)
     t=inf    x2*y2, which gives w4 immediately

   At t=1/2 the value calculated is actually 16*X(1/2)*Y(1/2), giving a
value for 16*W(1/2), and this is always an integer.  At t=inf the value
is actually X(t)*Y(t)/t^4 in the limit as t approaches infinity, but
it's much easier to think of as simply x2*y2 giving w4 immediately
(much like x0*y0 at t=0 gives w0 immediately).

   Now each of the points substituted into W(t)=w4*t^4+...+w0 gives a
linear combination of the w[i] coefficients, and the value of those
combinations has just been calculated.

        W(0)   =                                 w0
     16*W(1/2) =    w4 + 2*w3 + 4*w2 + 8*w1 + 16*w0
        W(1)   =    w4 +   w3 +   w2 +   w1 +    w0
        W(2)   = 16*w4 + 8*w3 + 4*w2 + 2*w1 +    w0
        W(inf) =    w4

   This is a set of five equations in five unknowns, and some
elementary linear algebra quickly isolates each w[i], by subtracting
multiples of one equation from another.

   In the code the set of five values W(0),...,W(inf) will represent
those certain linear combinations.  By adding or subtracting one from
another as necessary, values which are each w[i] alone are arrived at.
This involves only a few subtractions of small multiples (some of which
are powers of 2), and so is fast.  A couple of divisions remain by
powers of 2 and one division by 3 (or by 6 rather), and that last uses
the special `mpn_divexact_by3' (Note: Exact Division).

   In the code the values w4, w2 and w0 are formed in the destination
with pointers `E', `C' and `A', and w3 and w1 in temporary space `D'
and `B' are added to them.  There are extra limbs `tD', `tC' and `tB'
at the high end of w3, w2 and w1 which are handled separately.  The
final addition then is as follows.

      high                                        low
     +-------+-------+-------+-------+-------+-------+
     |       E       |       C       |       A       |
     +-------+-------+-------+-------+-------+-------+
              +------+-------++------+-------+
              |      D       ||      B       |
              +------+-------++------+-------+
           --      --      --
          |tD|    |tC|    |tB|
           --      --      --

   The conversion of W(t) values to the coefficients is interpolation.
A polynomial of degree 4 like W(t) is uniquely determined by values
known at 5 different points.  The points can be chosen to make the
linear equations come out with a convenient set of steps for isolating
the w[i].

   In `mpn/generic/mul_n.c' the `interpolate3' routine performs the
interpolation.  The open-coded one-pass version may be a bit hard to
understand, the steps performed can be better seen in the `USE_MORE_MPN'
version.

   Squaring follows the same procedure as multiplication, but there's
only one X(t) and it's evaluated at 5 points, and those values squared
to give values of W(t).  The interpolation is then identical, and in
fact the same `interpolate3' subroutine is used for both squaring and
multiplying.

   Toom-3 is asymptotically O(N^1.465), the exponent being
log(5)/log(3), representing 5 recursive multiplies of 1/3 the original
size.  This is an improvement over Karatsuba at O(N^1.585), though
Toom-Cook does more work in the evaluation and interpolation and so it
only realizes its advantage above a certain size.

   Near the crossover between Toom-3 and Karatsuba there's generally a
range of sizes where the difference between the two is small.
`TOOM3_MUL_THRESHOLD' is a somewhat arbitrary point in that range and
successive runs of the tune program can give different values due to
small variations in measuring.  A graph of time versus size for the two
shows the effect, see `tune/README'.

   At the fairly small sizes where the Toom-3 thresholds occur it's
worth remembering that the asymptotic behaviour for Karatsuba and
Toom-3 can't be expected to make accurate predictions, due of course to
the big influence of all sorts of overheads, and the fact that only a
few recursions of each are being performed.  Even at large sizes
there's a good chance machine dependent effects like cache architecture
will mean actual performance deviates from what might be predicted.

   The formula given above for the Karatsuba algorithm has an
equivalent for Toom-3 involving only five multiplies, but this would be
complicated and unenlightening.

   An alternate view of Toom-3 can be found in Zuras (Note:
References), using a vector to represent the x and y splits and a
matrix multiplication for the evaluation and interpolation stages.  The
matrix inverses are not meant to be actually used, and they have
elements with values much greater than in fact arise in the
interpolation steps.  The diagram shown for the 3-way is attractive,
but again doesn't have to be implemented that way and for example with
a bit of rearrangement just one division by 6 can be done.


automatically generated by info2www version 1.2.2.9