Hello, Spatial!

In this section, you will learn about the following components in Spatial:

Application skeleton (import statements, application creation, accel scope, host scope)
DRAM
SRAM
ArgIn
ArgOut
HostIO
Reg
Typing system
Data transfer between host and accel
Basic debugging hooks

Note that a large collection of Spatial applications can be found here.

Overview

In this section, you will see how to put together the bare-minimum Spatial application. While the code does not do any “meaningful” work, it demonstrates the basic primitives that almost all applications have and is intended to be the “Hello, world!” program for hardware. You will start by generating input and output registers to get the accelerator and host to interact with each other, and then add tile transfers between the off-chip DRAM and on-chip SRAM. You will then learn what functions are provided to test functionality and utilize the host. Finally, you will learn the basic compilation flows for testing the functionality of the algorithm, cycle-accurate simulation of the generated RTL, and bitstream generation to deploy to a supported FPGA or architecture.

On the right is a visualization of what we will be doing in this tutorial. We start with a host and an FPGA, both connected to DRAM. We will then instantiate all of the different ways you can get the two processors to interact with each other. We will create an RTL that will sit inside the FPGA, as well as some C++ code that will sit inside the host. Spatial automatically instantiates a box called “Fringe,” which is an FPGA-agnostic hardware design that allows the RTL to interact with peripherals, DRAM, PCIe buses, and whatever else is available on a given SoC or FPGA board.

Application template

import spatial.dsl._

@spatial object HelloSpatial extends SpatialApp {
  def main(args: Array[String]): Void = {
    // Host Code
    Accel {
      // Acceleratable Code
    }
    // More Host Code
  }
}

All Spatial programs have a few basic components. The code example to the left shows each of those components for an application that is called HelloSpatial. This is a complete app and you are free to go ahead and try to compile it. It won’t do much of anything, but we won’t prevent you from using it anyway.

To compile the app for a particular target, see the Targets page

Argin/argout interfaces

import spatial.dsl._

@spatial object HelloSpatial extends SpatialApp {
  def main(args: Array[String]): Void = {
    // Create Args
    val x = ArgIn[Int]
    val y = ArgOut[Int]
    val z = HostIO[Int]
    
    // Set `x` to the value of the first command line argument
    setArg(x, args(0).to[Int])
    // Set `z` to the value of the second command line argument
    setArg(z, args(1).to[Int])
    
    Accel {
      // Set `y` to the sum of `x` and `z`
      y := x + z
    }
    // Report the answer
    println(r"Result is ${getArg(y)}")
  }
}

We will now continue developing our Spatial app above and add ArgIns, ArgOuts, and HostIOs.

The most basic way to get data in and out of the FPGA is to pass individual arguments between the Accel and the host. Some use-cases for ArgIns include passing application parameters to the Accel, such as a damping factor in PageRank or data structure dimensions in an algorithm like GEMM, and use-cases for ArgOuts include cases like reporting the scalar result of Dot Product. We define a few of these registers above the Accel block so that the CPU can allocate them.

By this point, you have probably noticed that we keep specifying everything as an Int in square brackets. These square brackets are how Scala passes along type arguments. Spatial is a hardware language that supports a few types besides the standard 32-bit integers and you can define them as you wish. Some examples include:

type T = FixPt[FALSE,_16,_0]
type Flt = Float

Notice that in the println, the quotes are preceded by “r.” This will substitute text in the form of ${value} with their actual value when the program runs.

Reg, sram, and basic control

import spatial.dsl._

@spatial object HelloSpatial extends SpatialApp {
  def main(args: Array[String]): Void = {
    // Create ArgIn
    val x = ArgIn[Int]
    
    // Set `x` to the value of the first command line argument
    setArg(x, args(0).to[Int])
    
    Accel {
      // Create 16x32 SRAM and a Register
      val s = SRAM[Int](16,32)
      val r = Reg[Int]
      
      // Loop over each element in SRAM
      Foreach(16 by 1, 32 by 1){(i,j) => 
        s(i,j) = i + j
      }
      // Store element into the register, based on the input arg
      r := s(x,x)

      // Print value of register (only shows in Scala simulation)
      println(r"Value of SRAM at (${x.value},${x.value}) is ${r.value}")
    }

  }
}

Next, we will demonstrate how to add memory structures to the Accel. For a complete list of memories, please reference the documentation. In this example, we will only show Reg and SRAM.

In the snippet to the left, we keep the arg x. We will use this to index into a memory inside the Accel. We create the memories s and r. In Spatial, SRAMs can be created up to five dimensions, while Regs are single-element data structures.

We create a Foreach loop to access each element of s. The first set of arguments for the Foreach specify the counter bounds for the loop. In this example, we indicate that the first index should loop up to 16, and the second should loop up to 32. Most generally, loop ranges can be specified with:

<start> until <stop> by <step> par <par>

In the example, 16 by 1 is equivalent to 0 until 16 by 1 par 1. The lower bound, upper bound, and step size permit negative values, while the par factor must be a positive integer. Later tutorials will discuss more complicated usage of control structures and counters. For now, we will not parallelize or nest any loop.

The second argument to the Foreach loop binds the counter values to variables, in this case i and j. It then specifies the action to take for each iteration.

Note that to write to an SRAM, you specify the address and use =. To write to a Reg, you use :=. To read from an SRAM, you can directly pass the address, and to read from a Reg, you can use .value.

drams and transfers

import spatial.dsl._

@spatial object HelloSpatial extends SpatialApp {
  def main(args: Array[String]): Void = {
    // Create DRAM (malloc)
    val d = DRAM[Int](16)
    
    // Set DRAM (memcpy)
    val data = Array.fill[Int](16)(0)
    setMem(d, data)
    
    Accel {
      // Create 16-element SRAM
      val s = SRAM[Int](16)
      
      // Transfer data from d to s
      s load d(0::16)
    
      // Add number to each element
      Foreach(16 by 1){i => s(i) = s(i) + i}

      // Transfer data back to d
      d(0::16) store s
    }
    
    // Print contents in memory
    printArray(getMem(d), "Result: ")

  }
}

In most cases, we will want to use Spatial to write applications that use the FPGA as an application accelerator that offloads computation from the CPU. This means there will be large data structures that we will want to share between the two. To do this, we introduce DRAMs, which are data structures that are allocated by the CPU and can be read from and written to by the FPGA.

In the example on the left, we create a DRAM, set the contents of the DRAM in the host, tell the accelerator to modify the contents, and then read them back out in the host. Creating a DRAM is effectively a malloc, where the host allocates a region of memory and passes the pointer for this region to the FPGA. We then create a datastructure on the host using Array, and filling each element in this array with the value 0. See the documentation for more complicated ways of creating Arrays, Matrices, and Tensors. We set the data in the DRAM with the contents of our Array, which is effectively a memcpy to the region of memory.

With the DRAM configured and loaded, we can create a local memory in the accelerator, s, and transfer the contents of d to this SRAM. Note that data cannot be manipulated directly in DRAM, and transferring from DRAM to SRAM occurs at the burst-granularity. This means that, for a device like the ZC706 SoC, a single burst from DRAM provides 512 bits. If you load a number of elements whose bitwidth is not evenly divided by the burst size, you will sacrifice some bandwidth to data that will be ignored by the accelerator.

In this example, we load and store from d using (0::16). This means elements from address 0 until address 16 will be transferred sequentially. More generally, it is possible to add parallelization to this with the syntax (<start>::<stop> by <step> par <par>). Having a parallelization factor allows the transfer hardware to place more elements on the bus simultaneously, and speeds up the transaction at the expense of extra hardware.

We modify each element in s using a one dimensional Foreach, and then write the data back to the original DRAM. Outside of the accelerator, we can use printArray to have the host print out the contents of the memory.

Like SRAMs, DRAMs can have up to five dimensions.

previous: control flow

next: Inner Product