GNU Info

Info Node: (gmp.info)Single Limb Division

(gmp.info)Single Limb Division


Next: Basecase Division Prev: Division Algorithms Up: Division Algorithms
Enter node , (file) or (file)node

Single Limb Division
--------------------

   Nx1 division is implemented using repeated 2x1 divisions from high
to low, either with a hardware divide instruction or a multiplication by
inverse, whichever is best on a given CPU.

   The multiply by inverse follows section 8 of "Division by Invariant
Integers using Multiplication" by Granlund and Montgomery (Note:
References) and is implemented as `udiv_qrnnd_preinv' in
`gmp-impl.h'.  The idea is to have a fixed-point approximation to 1/d
(see `invert_limb') and then multiply by the high limb (plus one bit)
of the dividend to get a quotient q.  With d normalized (high bit set),
q is no more than 1 too small.  Subtracting q*d from the dividend gives
a remainder, and reveals whether q or q-1 is correct.

   The result is a division done with two multiplications and four or
five arithmetic operations.  On CPUs with low latency multipliers this
can be much faster than a hardware divide, though the cost of
calculating the inverse at the start may mean it's only better on
inputs bigger than say 4 or 5 limbs.

   When a divisor must be normalized, either for the generic C
`__udiv_qrnnd_c' or the multiply by inverse, the division performed is
actually a*2^k by d*2^k where a is the dividend and k is the power
necessary to have the high bit of d*2^k set.  The bit shifts for the
dividend are usually accomplished "on the fly" meaning by extracting
the appropriate bits at each step.  Done this way the quotient limbs
come out aligned ready to store.  When only the remainder is wanted, an
alternative is to take the dividend limbs unshifted and calculate r = a
mod d*2^k followed by an extra final step r*2^k mod d*2^k.  This can
help on CPUs with poor bit shifts or few registers.

   The multiply by inverse can be done two limbs at a time.  The
calculation is basically the same, but the inverse is two limbs and the
divisor treated as if padded with a low zero limb.  This means more
work, since the inverse will need a 2x2 multiply, but the four 1x1s to
do that are independent and can therefore be done partly or wholly in
parallel.  Likewise for a 2x1 calculating q*d.  The net effect is to
process two limbs with roughly the same two multiplies worth of latency
that one limb at a time gives.  This extends to 3 or 4 limbs at a time,
though the extra work to apply the inverse will almost certainly soon
reach the limits of multiplier throughput.

   A similar approach in reverse can be taken to process just half a
limb at a time if the divisor is only a half limb.  In this case the
1x1 multiply for the inverse effectively becomes two (1/2)x1 for each
limb, which can be a saving on CPUs with a fast half limb multiply, or
in fact if the only multiply is a half limb, and especially if it's not
pipelined.


automatically generated by info2www version 1.2.2.9