Cyclic Control Flow Graphs: Loop Unrolling

Posted on Sun 22 December 2024 in high_level_synthesis

Recall from Part III that we considered the GCD algorithm and manually converted this into RTL in FSMD form; we made no effort to automate this conversion. This was clearly a cop-out. In Part II we discussed an approach to scheduling which utilised the topological sort from a DFS, and were careful to curate those examples without any cycles; this is because the DFS only works on directed acyclic graphs (DAGs), i.e. those graphs which contain no cycles. Clearly this is a rather limiting constraint; the GCD had a backedge from while.body to while.cond, and consequently can't be scheduled using the DFS.

While I promise that we will get around to scheduling something like the GCD, for the moment it's convenient to treat a particular subset of algorithms which contain cycles; those algorithms with for-loops that can be unrolled.

Loop Unrolling

Loop-unrolling is an optimisation that compilers can undertake to manipulate loop nests. Consider a for-loop which iterates \(N\) times; this can be replaced with \(N\) duplicate statements, with the loop index variable substituted by the corresponding numerical value at each iteration. For instance, subjecting the following kernel

for (int idx=0; idx<N; ++idx) {
    c[idx] = a[idx] + b[idx];
}

to loop unrolling should yield something like the following:

c[0] = a[0] + b[0];
c[1] = a[1] + b[1];
c[2] = a[2] + b[2];
...
c[N-1] = a[N-1] + b[N-1];

For loop-unrolling in its simplest for, \(N\) should be a compile-time variable; this allows the compiler to unroll the loop nest fully (optimisations like loop-tiling can be undertaken on loop nests with runtime bounds, although we'll ignore this here for the sake of ease). The utility of this is clear for scheduling; we've completely eliminated the backedge resulting from the index bounds-checking and subsequent branching.

As an aside, you may be wondering why a compiler would want to provide this functionality in the first place. There are a few reasons, but principal amongst them are:

  1. Branching is an expensive operation for a CPU; it may require the flushing of an instruction pipeline.
  2. If there are no carried dependencies between loop iterations, SIMD operations can be leveraged (like in the vector addition example above).

An Example: Vector Addition

Consider the vector addition kernel:

#define SIZE 3
void vector_add(const int* __restrict__ a, const int* __restrict__ b, int* __restrict__ c) {
    for (int idx=0; idx<SIZE; ++idx) {
        c[idx] = a[idx] + b[idx];
    }
}

We can transform this into LLVM IR using the mem2reg and simplifycfg passes:

$> clang -O0 -Xclang -disable-O0-optnone -emit-llvm -S vector_add.c -o vector_add.ll
$> opt -S vector_add.ll -passes='mem2reg,simplifycfg' -o vector_add.ll

which yields:

define dso_local void @vector_add(i32* noalias noundef %a, i32* noalias noundef %b, i32* noalias noundef %c) #0 {
entry:
  br label %check

check:
  %loop_idx.current = phi i32 [ 0, %entry ], [ %loop_idx.next, %sum ]
  %done = icmp slt i32 %loop_idx.current, 3
  br i1 %done, label %sum, label %exit

sum:
  %a_idx = sext i32 %loop_idx.current to i64
  %a_ptr = getelementptr inbounds i32, i32* %a, i64 %a_idx
  %a_val = load i32, i32* %a_ptr, align 4
  %b_idx = sext i32 %loop_idx.current to i64
  %b_ptr = getelementptr inbounds i32, i32* %b, i64 %b_idx
  %b_val = load i32, i32* %b_ptr, align 4
  %c_val = add nsw i32 %a_val, %b_val
  %c_idx = sext i32 %loop_idx.current to i64
  %c_ptr = getelementptr inbounds i32, i32* %c, i64 %c_idx
  store i32 %c_val, i32* %c_ptr, align 4
  %loop_idx.next = add nsw i32 %loop_idx.current, 1
  br label %check, !llvm.loop !6

exit:
  ret void
}

where I've substituted the automated variable names for ones that are a little more descriptive. We can get LLVM to generate the CFG pictorially as follows:

$> opt -dot-cfg -disable-output -enable-new-pm=0 vector_add.ll

The mechanics of this are hopefully clear; we branch from the check basic block to the sum basic block, which subsequently branches back to check, and so on until check can finally branch to exit.

Loop Unrolling

We can enforce loop-unrolling through the addition of a further unroll-loops pass, where each loop iteration is implemented as a basic block which is branched to by the previous loop iteration's basic block:

define dso_local void @vector_add(i32* noalias noundef %a, i32* noalias noundef %b, i32* noalias noundef %c) #0 {
entry:
  ;; Compute c[0] = a[0] + b[0]
  %a_val = load i32, i32* %a, align 4
  %b_val = load i32, i32* %b, align 4
  %c_val = add nsw i32 %a_val, %b_val
  store i32 %c_val, i32* %c, align 4

  ;; Compute c[1] = a[1] + b[1]
  %a_ptr.1 = getelementptr inbounds i32, i32* %a, i64 1
  %a_val.1 = load i32, i32* %a_ptr.1, align 4
  %b_ptr.1 = getelementptr inbounds i32, i32* %b, i64 1
  %b_val.1 = load i32, i32* %b_ptr.1, align 4
  %c_val.1 = add nsw i32 %a_val.1, %b_val.1
  %c_ptr.1 = getelementptr inbounds i32, i32* %c, i64 1
  store i32 %c_val.1, i32* %c_ptr.1, align 4

  ;; Compute c[2] = a[2] + b[2]
  %a_ptr.2 = getelementptr inbounds i32, i32* %a, i64 2
  %a_val.2 = load i32, i32* %a_ptr.2, align 4
  %b_ptr.2 = getelementptr inbounds i32, i32* %b, i64 2
  %b_val.2 = load i32, i32* %b_ptr.2, align 4
  %c_val.2 = add nsw i32 %a_val.2, %b_val.2
  %c_ptr.2 = getelementptr inbounds i32, i32* %c, i64 2
  store i32 %c_val.2, i32* %c_ptr.2, align 4

  ;; Finished
  ret void
}

Clearly there are no loops here; the entire function has been implemented in a single basic block. Furthermore, since there are no carried loop dependencies, each element in the vector addition can be computed in parallel -- the first four instructions which deal with c[0] = a[0] + b[0] have no bearing on the subsequent sections where c[i] = a[i] + b[i] for \(i\gt 0\). This unrolling then allows us to extract concurrency from the loop nest; vector addition could, in principle, be implemented in an FPGA in \(\mathcal{O}(1)\) time, where the RTL resembles one large SIMD operation.

However, this does not come for free. For a start, unrolling the loop and simply doing a SIMD addition would require the RAM module to service the entire SIMD register in a single cycle, while our RAM module can only supply a single element per cycle. We could trivially alter the RAM module to provide burst access to the underlying memory, but this is a tweak we can defer to a later post. The more pressing issue is that we can't just arbitrarily unroll a loop since the FPGA has finite resources. We might be able to do this for a 100 or even 1000 element vector, but at some point we're going to run out of resources. Consider the case where we can only implement 10 ALUs in the FPGA fabric; then if the vectors have 16 elements, we can add 10 of the elements in a cycle, but the remaining 6 additions must be performed in a subsequent cycle. This problem is known as "binding", the mapping of operations to hardware resources. Hence, the schedule we obtain through analysis of the CFG only provides a "best case" schedule; we obtain a more realistic schedule when taking into account the constraints posed by binding.

List Scheduling

In the previous post, we alluded to some physical constraints of the RAM modules, in that only a single read or write can happen at a time, and that reads and writes cannot happen concurrently. For a realistic scheduling, these constraints must be provided for each resource that we use. An arithmetic unit that we use for addition, for instance, could take some non-zero number of cycles to produce a result; this is the latency of the resource. Similarly, we touched on the fact that there aren't infinite resources in the FPGA fabric allowing us to do an arbitrary number of additions in a single cycle. This combination of resource latency and number of a particular resource are the resource constraints, and must be accounted for when scheduling. This is referred to as list scheduling. List scheduling combines the heuristics we've talked about previously (ASAP and ALAP), but incorporates the physical resource constraints; if an instruction is ready to be scheduled by virtue of its dependencies having been resolved, there must be a physical resource with capacity to undertake the instruction, else it must be deferred to a later cycle where a resource is available.

For the vector addition example, we'll assume that we have a single adder resource which has a latency of three cycles. We'll find it convenient to have a dedicated adder entity which fakes the addition latency:

adder : process(clk) is
    variable count : integer := 0;
begin
    if rising_edge(clk) then
        if (rst = '1') then
            done <= '0';
            c    <= (others => '0');
        else
            if (start = '1') then
                count := 0;
            end if;
            count := count + 1;
            if (count = C_LATENCY) then
                c    <= std_logic_vector(unsigned(a) + unsigned(b));
                done <= '1';
            end if;
        end if;
    end if;
end process adder;

When the start signal is asserted for a single cycle, then the process shall do nothing for C_LATENCY cycles, and on the third subsequent clock rising edge shall return the values presented on the a and b ports, simultaneously asserting the done signal.

For the loop unrolled vector addition, we have the following dependency graph:

Dependency graph for the vector addition kernel. Each unrolled section takes the same form, and N is a placeholder from 0 through to the size of the vector.

We can draw the corresponding schedule based on the resource utilisation at different times. The following depicts such a schedule; each resource is enumerated vertically, and each operation takes some number of clock cycles to complete, graphed as a horizontal span for a particular resource:

Resource schedule for 3-element vector addition where the latency for each addition is three cycles and the latency for each memory operation is one cycle.

We see the effects of an ASAP schedule here; while load 1 and load 2 instructions aren't needed until cycles 6 and 9, respectively, to feed the adder resource, the a_ram and b_ram resources that they rely on are free from cycle 2. As a result, the instructions are scheduled as early as is practical. If we were to use the ALAP heuristic, then load 1 and load 2 would be scheduled on cycles 5 and 8, respectively, to feed the adder on the subsequent cycles.

You might wonder why there's a cycle delay between the end of load 0 and the start of the addition to compute c_val.0. The issue here is that once we've requested the values in the A and B RAMs, they must be registered so that they can persist on the input ports to the adder module, hence an additional cycle to register the values returned from the A and B RAMs is required.

FSMD Implementation

The following is the trace from the FSMD which implements the vector addition kernel as scheduled above:

Stimulation of the vector add module; performs addition of the vectors [0, 1, 2] and [2, 2, 0]. On completion, we can see that the contents of RAM C are [2, 3, 2], as expected. Note that you can horizontally scroll this image.

Hopefully it's clear that the above matches up well with the schedule we depicted; on the first, second and third clock rising edges, RAMs A and B are presented with as ascending address, and on the following cycles the corresponding data_out ports from the RAM modules contain the values at those locations. In the adder module, the start signal is deasserted on cycles four, seven and ten indicating that the addition commences on the previous cycles (three, six and nine) in accordance with our schedule, and the result on the c port transitions on cycles six, nine and twelve; this port then directly feeds into the data_in port on RAM C, whose wr_enable is asserted the same cycle, storing the value at the provided address. The contents of RAM C are then [2, 3, 2], as expected, from cycle thirteen onwards.