Memory Subsystems
Posted on Sun 08 December 2024 in high_level_synthesis
So far, we've dealt with algorithms where the data is magically presented to the various arithmetic operators which constitute our algorithm. While this is a reasonable assumption for some small dataflow element which is presented with new data on each clock cycle from another amorphous dataflow element, we're lacking any notion of persistence of the values we're computing. For a microprocessor, this is the motivation behind the von Neumann architecture; data flows from some persistent memory to the arithmetic units, and the result is streamed back to this persistent memory once computed. However, it's a little less clear how this manifests on an FPGA; data is spatially streaming through the various instantiated operators that form our algorithm.
For the moment, we'll ignore peripheral memory elements like DRAMs; we can come to that later, but ultimately it should all boil down to what we'll explore in this post. The FPGA isn't restricted to instantiating arithmetic operators; it's an enormous array of flip-flops, so the persistency that we require is already there. We can configure these flip-flops to take on the form of a RAM without issue. The purpose of this post is to explore how we infer RAMs from LLVM IR, and use these inferred RAMs in the FSMD.
Single-Port RAM Module
First things first, if we're going to be inferring a RAM to RTL, we'd better understand what the RAM looks like in RTL. When reading, we provide some index and we get the value stored at that index on an output bus. Similarly, when writing we provide some index and a value on an input bus, and the RAM stores the value at that offset in its internal memory space. We need to mitigate against things like reading and writing on the same cycle, hence some control signals must also be used. There are many protocals we could use for these interfaces; AXI is something of a de facto standard, and we'll undoubtedly visit it at some point, but for the moment we'll do something simple;
-- ram_pkg.vhd
package ram_pkg is
type ram_interface_input is record
wr_enable : std_logic;
addr : std_logic_vector;
data_in : std_logic_vector;
end record ram_interface_input;
type ram_interface_output is record
data_out : std_logic_vector;
end record ram_interface_output;
end package ram_pkg;
The package provides a few convenient record types to bundle signals together and provide cleaner port interfaces.
-- ram.vhd
entity ram is
generic(
C_ADDR_WIDTH : positive := 8;
C_DATA_WIDTH : positive := 8
);
port(
clk : in std_logic;
rst : in std_logic;
input : in work.ram_pkg.ram_interface_input(addr(C_ADDR_WIDTH - 1 downto 0), data_in(C_DATA_WIDTH - 1 downto 0));
output : out work.ram_pkg.ram_interface_output(data_out(C_DATA_WIDTH - 1 downto 0)) := (data_out => (others => '0'))
);
end entity ram;
In the entity declaration, we have a couple of generics that we can paramterise the RAM module with and constrain the sizing of record elements we're using on the interface; the address width, which renders the RAM module capable of storing \(2^{\mathrm{C\_ADDR\_WIDTH - 1}}\) addressable elements, and the data width which allows the RAM to store elements comprising \(\mathrm{C\_DATA\_WIDTH}\) bits. The addr
signal is used to request an element at the corresponding address; the RAM module shall obligingly place the stored value at that address on the data_out
bus. If wr_enable
is set to high, then the RAM module shall take the value on the data_in
bus and store it at the address indicated by addr
. All of these operations are synchronous with the clk
signal.
The implementation of this module is straightforwards:
architecture behavioural of ram is
type data_array is array(0 to 2 ** C_ADDR_WIDTH - 1) of std_logic_vector(C_DATA_WIDTH - 1 downto 0);
signal data : data_array := (others => (others => '0'));
begin
process(clk) is
begin
if rising_edge(clk) then
if (rst = '1') then
data <= (others => (others => '0'));
else
if (input.wr_enable = '1') then
data(to_integer(unsigned(input.addr))) <= input.data_in;
end if;
output.data_out <= data(to_integer(unsigned(input.addr)));
end if;
end if;
end process;
end architecture behavioural;
The following trace comes from a testbench which instantiates the entity with C_ADDR_WIDTH
and C_DATA_WIDTH
both equal to two. The entity is first stimulated with a reset signal, which ensures the internal state of the RAM is zeroed, followed by four consecutive cycles of writing the negation of the address to the corresponding element in the RAM. Finally, the wr_enable
signal is disabled and the value at address 0x0
read back out on the data_out
bus.

Loads and Stores
The simplest type of loop we can consider involves looping over the elements of some iterable; a vector of numbers, a string of characters, etc. Ultimately, this boils down to the same operation; we have some base pointer which points to the start of the iterable, and we increment a loop pointer by some stride that's dictated by the size of the elements to be iterated over (one byte for a character in a string, eight bytes for double precision floating point values in a vector, etc.) to arrive at a neighbouring element in the iterable. For a CPU, this translates into a series of load and/ or store instructions:
- Initialise the loop pointer to the start of the iterable (i.e. loop pointer points to the same thing as the base pointer of the iterable).
- Load the value from the address pointed to by the loop pointer into a register.
- If the value is to be mutated, manipulate the register then store the manipulated value in the register back to the address pointed to by the loop pointer.
- If the loop pointer is not at the end of the iterable, increment the loop pointer by the datum stride and go to (2). Else finish.
Single Element
Let's take a look at what this looks like in LLVM IR for a special case where the iterable has a single element:
void single_element(int* __restrict__ a) {
int a_val = *a;
++a_val;
*a = a_val;
}
Here we load the value from the pointer, increment the underlying value by one and store this incremented value back to the iterable.
define dso_local void @single_element(i32* noalias noundef %a) #0 {
%2 = load i32, i32* %a, align 4
%3 = add nsw i32 %2, 1
store i32 %3, i32* %a, align 4
ret void
}
Iterating on Multiple Elements
Generalising the above to iterate on multiple elements is similarly trivial, although we have an additional instruction to consider in LLVM IR:
#define SIZE 10
void multiple_element(int a[SIZE]) {
for (int idx=0; idx<SIZE; ++idx) {
int a_val = a[idx];
++a_val;
a[idx] = a_val;
}
}
define dso_local void @multiple_element(i32* noundef %a) #0 {
br label %2
2:
%.0 = phi i32 [ 0, %1 ], [ %11, %4 ]
%3 = icmp slt i32 %.0, 10
br i1 %3, label %4, label %12
4:
%5 = sext i32 %.0 to i64
%6 = getelementptr inbounds i32, i32* %a, i64 %5
%7 = load i32, i32* %6, align 4
%8 = add nsw i32 %7, 1
%9 = sext i32 %.0 to i64
%10 = getelementptr inbounds i32, i32* %a, i64 %9
store i32 %8, i32* %10, align 4
%11 = add nsw i32 %.0, 1
br label %2, !llvm.loop !6
12:
ret void
}
The first (non-trivial) basic block handles the loop incrementing; if we've come from the start of the function, set the index to zero, else take the incremented index from the subsequent basic block, then either branch to the exit basic block (if the loop index has reached its upper bound) or the loop logic basic block. The next basic block introduces a new instruction, getelementptr
, which takes the base pointer and the loop index as arguments, simply returning a pointer to the element offset from the base pointer by the loop index. The rest of this basic block then simply does the same load/ store logic as we saw in the previous section, with a second clal to getelementptr
to retrieve the write address.
Inferring RAMs
For the sake of brevity, we'll assume that any RAM modules shall be passed in as arguments to the function; loads and stores are then restricted to these (mutable) arguments, and no RAM modules shall be inferred for variables that are local to the function. This constraint can be removed without too much additional logical complexity (we just look for getelementptr
instructions and infer a RAM from the base pointer argument), but we save on some boilerplate by enforcing it. For each function argument, if the argument is a pointer-type, we infer a RAM:
for (auto& arg : function.args()) {
if (arg.getType()->isPointerTy()) {
hw_constraints_.AddRAM(std::string(arg.getName()));
}
}
Constraints
Regarding the sizing of RAM modules, since statically-bounded array arguments in C, like int a[10]
, simply decay to int *
, it's difficult to bound the size of the RAM module since in LLVM we lose the static size. This is rectified easily enough by adding ancillary information such as a size
parameter to accompany the input array, but for the sake of keeping things as simple as possible, we'll assume that the size of all arrays can be bounded from above by some constant. Of course this means that we'll end up being a little wasteful for those arrays that are smaller than this upper bound on size, but we're only working with toy problems at the moment anyway, so we needn't concern ourselves too much at this point.
For each unique base pointer, we can infer the need to create a RAM module as we've just explained above. This introduces the constraint that we do not permit pointer aliasing; if we're generating RAMs based on unique base pointers in the getelementptr
instruction, we can't have multiple pointers to the same data else we end up duplicating RAMs. Hence, all vector inputs to a function should be marked __restrict__
.
Given our RAM is single port, we can only request a single value per clock cycle, meaning that consecutive load
instructions from the same RAM module must be serialised; the CFG in LLVM IR does not encode this constraint, and is something we must enforce with logic in our HLS tool. Similarly, we can't load
and store
to/ from the same RAM module in a single cycle, and again provisions must be made in our HLS tool. We can eventually rectify this by adding double-port RAM modules, but then the constraint becomes not being able to load
three values in the same clock cycle, so we'll always have to wrangle the logic ourselves to some degree.
Single Element
Returning to the single_element
example above, the load and stores take the form:
%2 = load i32, i32* %a, align 4
...
store i32 %3, i32* %a, align 4
So, for the load
instruction, the RAM that we're interacting with is encoded in the first argument, and for the store
instruction it's the second argument. Consequently, if %a
corresponds to a function argument that has had a RAM inferred, we can immediately manipulate the corresponding ram
module in our FSMD.
As mentioned previously, we're working with a single-port RAM here, so we need to take some care to ensure that we're not storing and loading on the same cycle, and similarly that we're not doing multiple stores or loads on the same cycle. We'll discuss in the next post how we enforce this.
Multiple Element
Let's revisit the multiple_element
example from the previous section. The loads and stores are slightly different from the single_element
example where we just interacted with the data at the base address:
%6 = getelementptr inbounds i32, i32* %a, i64 %5
%7 = load i32, i32* %6, align 4
...
%10 = getelementptr inbounds i32, i32* %a, i64 %9
store i32 %8, i32* %10, align 4
While the load
and store
instructions still have as their first and second arguments, respectively, the piece of memory they're interacting with, these are the result of getelementptr
instructions, and don't directly encode the function argument %a
, and as a result we can't immediately retrieve the RAM module to manipulate. Rather, we have to follow the getelementptr
instruction, and retrieve both the RAM module and the offset from there. The following is a snippet which parses a store
instruction:
// The store instruction we're parsing
llvm::Instruction instr;
// Try to get the first argument to the store; this is either the base pointer
// or the result of a `getelementptr`
if (auto* ptr = llvm::dyn_cast<llvm::Instruction>(instr->getOperand(1))) {
// If it was the result of a `getelementptr`, then interpret as such and retrieve
// the base pointer as the first argument
if (llvm::isa<llvm::GetElementPtrInst>(ptr)) {
auto ram_name = std::string(ptr->getOperand(0)->getName());
auto offset = ptr->getOperand(1);
}
}
It's worth noting that for HLS, when we encounter a getelementptr
instruction, we needn't take any scheduling action; the only thing we care about is the base pointer and offset arguments; we then take these values and provide them to the subsequent load
or store
instruction.