GNU Info

Info Node: (gmp.info)Assembler Loop Unrolling

(gmp.info)Assembler Loop Unrolling


Prev: Assembler Software Pipelining Up: Assembler Coding
Enter node , (file) or (file)node

Loop Unrolling
--------------

   Loop unrolling consists of replicating code so that several limbs are
processed in each loop.  At a minimum this reduces loop overheads by a
corresponding factor, but it can also allow better register usage, for
example alternately using one register combination and then another.
Judicious use of `m4' macros can help avoid lots of duplication in the
source code.

   Unrolling is commonly done to a power of 2 multiple so the number of
unrolled loops and the number of remaining limbs can be calculated with
a shift and mask.  But other multiples can be used too, just by
subtracting each N limbs processed from a counter and waiting for less
than N remaining (or offsetting the counter by N so it goes negative
when there's less than N remaining).

   The limbs not a multiple of the unrolling can be handled in various
ways, for example

   * A simple loop at the end (or the start) to process the excess.
     Care will be wanted that it isn't too much slower than the
     unrolled part.

   * A set of binary tests, for example after an 8-limb unrolling, test
     for 4 more limbs to process, then a further 2 more or not, and
     finally 1 more or not.  This will probably take more code space
     than a simple loop.

   * A `switch' statement, providing separate code for each possible
     excess, for example an 8-limb unrolling would have separate code
     for 0 remaining, 1 remaining, etc, up to 7 remaining.  This might
     take a lot of code, but may be the best way to optimize all cases
     in combination with a deep pipelined loop.

   * A computed jump into the middle of the loop, thus making the first
     iteration handle the excess.  This should make times smoothly
     increase with size, which is attractive, but setups for the jump
     and adjustments for pointers can be tricky and could become quite
     difficult in combination with deep pipelining.

   One way to write the setups and finishups for a pipelined unrolled
loop is simply to duplicate the loop at the start and the end, then
delete instructions at the start which have no valid antecedents, and
delete instructions at the end whose results are unwanted.  Sizes not a
multiple of the unrolling can then be handled as desired.


automatically generated by info2www version 1.2.2.9