Loops optimization technique: unrolling

By Andrey Dmitriev

Sunday, November 09, 2003

This material shows how works "loops unrolling" technique for optimization. Its a very easy, but effective way to increase performance of C-based application (or, of course, DLL, which called from LabVIEW). Two compilers will be compared: Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 12.00.8804 and Intel(R) C++ Compiler for 32-bit applications, Version 7.0.

 

 

Complete code for testing shown below:

// Unroll Loop.cpp : Defines the entry point for the console application.//

#define BODY sum += (int)(a[i]*a[i]); i++;

#include "stdafx.h"
#include <malloc.h>
#include <stdio.h>
#include <windows.h>

#define BLOCK_SIZE  (64*524288)

int main(int argc, char* argv[])
{
    int i;
    int sum = 0;
    int StartTime, CalcTime;

    short int *a;

    a = (short int*)malloc (BLOCK_SIZE * sizeof(short int));

    printf ("Unrolling test:\n");
    StartTime = GetTickCount();

    for (i = 0; i < BLOCK_SIZE; i++){
        sum += (int)(a[i]*a[i]);
    }

    CalcTime = GetTickCount() - StartTime;
    printf("Execution Time = %d ms\n", CalcTime);

    free(a);

    return sum;
}

On my computer this code executed approximately 200 milliseconds.
Unrolling a loop reduces the cost of loop overhead by decreasing the number of times you check the loop condition. Essentially, loop unrolling increases the number of computations per iteration. To unroll a loop, you perform two or more of the same statements for each iteration, and increment the counter accordingly. So instead of code of for-loop above:

    for (i = 0; i < BLOCK_SIZE; i+=4){
        sum += (int)(a[i]*a[i]);
        sum += (int)(a[i+1]*a[i+1]);
        sum += (int)(a[i+2]*a[i+2]);
        sum += (int)(a[i+3]*a[i+3]);
}

This code a little bit faster: on my P4 1.6 GHz needs approximately 170 millisecond. But this code is a very hard for support. If number of unrolling steps will be changed, then you must manually correct counter, and add increment... One idea - using preprocessor for todo this. Unfortunately C preprocessor doesn't supported cyclic macros, but amount of work will be reduced, if increment of counter will be added into define declaration:

#define BODY sum += (int)(a[i]*a[i]); i++;

    for (i = 0; i < BLOCK_SIZE;){
        BODY; BODY; BODY; BODY;
        BODY; BODY; BODY; BODY;
        BODY; BODY; BODY; BODY;
        BODY; BODY; BODY; BODY;
    }

With INTEL C++ compiler situation much more better, because this compiler able to unroll loops automatically!

You can unroll loops and specify the maximum number of times you want the compiler to do so.

How to Enable Loop Unrolling

Use the -Qunroll[ n] option to unroll loops. The n variable you enter determines the maximum number of times for the unrolling operation. This applies only to loops that the compiler determines should be unrolled. Omit n to let the compiler decide whether to perform unrolling or not.

When specifying high values to unroll loops, be aware that your application may exhaust certain resources, such as registers, that can slow program performance. You should consider timing your application if you specify high values to unroll loops.

How to Disable Loop Unrooling

Disable loop unrolling by setting the n variable to 0: -Qunroll0. The default for both IA-32 and Itanium-based systems is ON. However, since n is not specified, the compiler automatically chooses the maximum number of times to unroll a loop based on internal optimization algorithms. You can also omit n to allow the compiler to decide whether or not to perform loop unrolling. In these cases there is no limit to the number of times a loop may be unrolled.

unroll Directive

The unroll directive ( unroll (n)|nounroll ) tells the compiler how many times to unroll a counted loop. The syntax for this directive is: #pragma unroll #pragma unroll (n) #pragma nounroll where n is an integer constant from 0 through 255. The unroll directive must precede the for statement for each for loop it affects. If n is specified, the optimizer unrolls the loop n times. If n is omitted, or if it is outside the allowed range, the optimizer assigns the number of times to unroll the loop. The unroll directive overrides any setting of loop unrolling from the command line. The directive can be applied only for the innermost nested loop. If applied to the outer loops, it is ignored. The compiler generates correct code by comparing n and the loop count:

#pragma unroll(4)

    for (i = 0; i < BLOCK_SIZE; i++){
        sum += (int)(a[i]*a[i]);
    }

LabVIEW cycle

It is a little bit complicated to create exactly the same construction in LabVIEW. But may be following diagram will be according to code above:

This code needs 1,5 seconds for execution... LabVIEW code more than 6 times slowly. Not very interesting for unrolling - its in any case will be more slowly...
More optimal will be following construction:

Thic code without cycle. Now this code needs less than half second for execution.

At the end summary graph for all samples above:

Resume. If you using Intel C++ compiler, then you don't need any special modifications for unrolling. Compiler done with this automatically.
Calculations with long cycles in LabVIEW usually slowly, and it will be good idea to replace some parts of code with C-based DLLs.