Loop Unrolling

Loop unrolling is a key mechanism for driving parallelism in the High Level Synthesis (HLS) process. Designers must be aware that dependencies from iteration to iteration can limit absolute parallelism when the loop is partially or even fully unrolled.

Stuart Clubb

Overview
A simple loop example in C++ can be an accumulator or multiply-accumulate. Here we show the simple loop code for an accumulation of four values from an array:
```
ac_int<8, true> value[4];
ac_int<10, true> acc = 0;
LOOP:for (int i = 0; i < 4; i++) {
	acc+= value[i];
}
```
Conceptually if the loop is not constrained, we can think of the loop as having a virtual clock cycle “wait” at the end of every iteration. Loop unrolling takes the sequential iterations and attempts to make them parallel. Loops can be left rolled, partially unrolled, or fully unrolled, depending upon the parallelism and throughput that the designer is seeking to achieve.
If left rolled, this four-iteration loop would have a theoretical minimum throughput of four clock cycles. Typically, you would expect to see one adder for the accumulator, a small counter for the loop index, and a multiplexer controlled by ethe two-bit counter (0,1,2,3)
Partial unrolling specifies how many copies of the sequential loop body are to be made in an attempt to parallelize the implementation. In this case, partially unrolling by a factor of two would result in a representation that involves two loop iterations. The first iteration represents iterations 0 and 1, while the second iteration represents iterations 2 and 3 of the original loop.
We don’t need to rewrite the code, we just apply an unrolling constraint. But if we were to rewrite it, it would look like this:
```
LOOP:for (int i = 0; i < 2; i++) {
	acc += value[2i];
	acc += value[2i + 1];
}
```
This can alternatively be rewritten to show that there is now an adder as well as an accumulator:
```
LOOP:for (int i = 0; i < 2; i++) {
	acc += value[2i] + value[2i + 1] ;
}
```
The multiplexing access to the “value” variable will be doubled, with half the number of inputs. That assumes that the “value” variable can be accessed in parallel. If the variable were to be stored in a single ported RAM, we might be limited to taking one value at a time, in which case trying to create more parallelism would be wasteful.
Fully unrolling the loop, or unrolling by the maximal number of iterations means that the loop will conceptually disappear:
acc = value[0] + value[1] + value[2] + value[3] ;
This requires three adders in an adder tree. It eliminates the index counter, and the multiplexing as now everything is hard-wired with constant indexes. Again, this assumes that the “value” can be accessed in parallel.
Loop unrolling constraints invariably go together with loop pipelining constraints to drive system-level throughput. It is almost always the case that matching the loop unrolling to the bandwidth of variables involved and using appropriate pipelining (Initiation Interval) will produce the optimum area/performance results.
For example if our accumulated value is stored in a single ported RAM, unrolling partially by two would not be beneficial because the loop body would contain two accesses to the RAM:
```
LOOP:for (int i = 0; i < 2; i++) {
	acc += value[2i];
	acc += value[2i + 1];
}
```
Therefore, any Finite State Machine would have to use one state (clock cycle) to read the first value, and another state (clock cycle) to read the second value then adding the two values together and then accumulating the result. That would mean that you could also only then repeat this iterative action for the [2] and [3] address accesses after two clock cycles, due to needing two reads of the RAM. Attempting to read two values from the RAM would be impossible (as it stands)
It would therefore be more efficient to keep the loop body rolled, such that the loop body could iterate every clock cycle and not require the intermediate adder or storage.
Only if we can increase the bandwidth of the RAM by making it dual-port, or increasing the data width by two (such that address [0] and [1] are side-by side) can we improve system0level throughput.
Loops can be unrolled by unequal factors. This tends to be uncommon, but it can be done.
For example, a loop that iterates 7 times could be unrolled by a factor of two.
```
LOOP:for (int i = 0; i < 7; i++) {
	acc += value[i];
}
```
Becomes manually unrolled code:
```
LOOP:for (int i = 0; i < 2; i++) {
	acc += value[2*i];
	if (i/=3) {
		acc += value[2*i+1]
	}
}
```
However, as you become experienced with HLS, you will learn that conditionals in unrolled loops are not desirable, and instead we can think of this code as:
```
LOOP:for (int i = 0; i < 2; i++) {
	acc += value[2 * i] + (i==1) ? 0 : value[2 * i + 1];
}
```
Resulting in a more efficient implementation.
Loop Unrolling Forum
- No content found.

HLS Hackathon

Wednesday, Jul 02nd-8:00 AM PDT

Accelerating Inferencing Using HLS Hackathon

Loop Unrolling

Overview

Loop Unrolling Forum