Parallel Programming on FPGAs

One of the hardware resources that has attracted interest in the last years are FPGAs, which stand for Field-Programmable Gate Arrays. Figure 1 represents one of these devices. They contain four types of components:

1. Programmable Logic, consisting of:

  • Look-Up Tables (LUTs), implementing truth tables
  • Memory elements, including Flip-flops/Latches
  • Gates and carry and control logic to implement arithmetic functions
  • Digital Signal Processors (DSPs)

2. Programmable input/output cells

3. Programmable interconnect

4. Block RAM (local memory)

Schematics if an FPGA device
Figure 1: Schematics of an FPGA device

 

These basic components of the FPGAs can be configured (activated, and interconnected) in a programmable way, offering a nearly unlimited set of alternatives to implement algorithms. For example, to implement matrix multiplication, one should connect a set of DSPs implementing multiplications and additions, on data obtained from Block RAMs, and store the results on a different set of Block RAMs. Additionally, a specific interconnection should be setup in order to bring data from outside of the device, and store it on the Block RAMs, and later bring the final result of the multiplication out of the FPGA. Such a configuration is described in a vendor specific configuration file, usually named a bitstream. The different components that are synthesized out of a bitstream are IP (Intellectual Property) blocks.

FPGA vendors (Intel, Xilinx, etc.) provide the compiler tools to configure the FPGA hardware properly, which is generating the bitstreams. These tools are highly complex, and usually proprietary, so there are no open-source options to configure major vendor’s FPGA devices. Their task is to transform a (relatively) high-level description of the hardware into the proper configuration of the device. Typically, FPGAs have been programmed with the VHDL/Verilog hardware descriptions, in the same way as custom chips or microprocessors are described. Other hardware description languages include BlueSpec [1] and Chisel [2].

More recently, high-level programming is also available for FPGA devices. Vendors provide C/C++ and OpenCL compilers, and the associated firmware frameworks supporting them on the FPGA. Such firmware consists of the basic logic/IO/interconnect/BRAM configuration to support data transfers to and from the device, and the invocation of the IP blocks instantiated on the FPGA. Such high-level compiler tools for C/C++ and OpenCL generate VHDL/Verilog projects that are later used to generate the bitstream to configure the FPGA.

In the context of the EPEEC project, we work towards implementing support for tasking within accelerators, and increase programmers’ productivity on complex heterogeneous environments. To this end, we propose to use higher level programming models to target FPGAs. We use OmpSs [3] to generate hardware descriptions and configurations for FPGAs. Figure 2 shows the OmpSs compiler toolchain that we are developing. We name this version of OmpSs as OmpSs@FPGA [4].

OmpSs toolchain for SMP and FPGA parallelism exploitation
Figure 2: OmpSs toolchain for SMP and FPGA parallelism exploitation

 

Code annotated with the target FPGA directive is outlined to separate files, that are then compiled with the autoVivado tool, along with a few wrapper functions and IP cores, in order to generate the bitstream. The rest of the code is compiled in the usual way to execute on the host processors, going through the traditional GCC compilation.

Figure 3 shows the matrix multiplication benchmark, annotated with OmpSs@FPGA directives. In the first task (function matmulBlock, lines 1-4), the directives indicate generating three instances of the IP block implementing this function, and the variables (a, b, and c) that should be transferred to the FPGA, their directionality and sizes. The fact that the system incorporates three independent instances of the matrix multiplication IP core, it means that up to three matrix blocks can be computed in parallel.

In the second task (function matmul, lines 14-18), the directives indicate that the functionality of spawning tasks should also be generated in the FPGA. This allows our system to implement part of the runtime functionalities onto the FPGA, highly reducing the overhead of task creation and management, especially when dealing with fine granularity tasks.

Figure 4 shows the evaluation of this matrix multiplication benchmark on the AXIOM board, an FPGA board developed in the AXIOM project [5]. It shows the performance obtained on the benchmark with matrices of size 2048x2048 single precision floating point elements, and block sizes of 64x64, 128x128 and 256x256 elements. The first (blue) bar shows the performance obtained when task creation (function matmul in lines 14-29 of Figure 3) is annotated targeting the host cores (“#pragma omp target device (smp)...”). In this case, the executions using block sizes of 64x64 and 128x128 elements show the overhead of task creation and management on the host cores, reaching 38 and 62 GFlop/s. On the other hand, when the task creation and management is moved to the FPGA (“#pragma omp target device (fpga)...”), the performance in these two cases increases up to 80 and 81 GFlop/s.

 

1.   #pragma omp target device(fpga) num_instances(3) copy_deps

   2.   #pragma omp task in([BSIZE*BSIZE]a, [BSIZE*BSIZE]b) \

   3.                 inout([BSIZE*BSIZE]c)

   4.   void matmulBlock(const elem_t *a, const elem_t *b, elem_t c)
   5.   {
   6.    for (k = 0; k < BSIZE; k++) {
   7.      for (i = 0; i < BSIZE; i++) {
   8.   #pragma HLS pipeline II=MBLOCK_II
   9.        for (j = 0; j < BSIZE; j++)
  10.          c[i*BSIZE + j] += a[i*BSIZE + k] * b[k*BSIZE + j];
  11.      }
  12.    }
  13.   }

  14.  #pragma omp target device(fpga) copy_deps

  15.  #pragma omp task in([msize*msize]a, [msize*msize]b) \

  16.                inout([msize*msize]c)

  17.  void matmul(const elem_t *a, const elem_t *b,
  18.                    elem_t *c, const unsigned int msize)
  19.  {
  20.   ...

  21.   for (i=0; i < msize/BSIZE; i++) {

  22.    for (j=0; j < msize/BSIZE; j++) {

  23.     for (k=0; k < msize/BSIZE; k++) {

  24.        matmulBlock(a[i*BSIZE+k], b[k*BSIZE+ j], c[i*BSIZE+j]);

  25.     }

  26.    }

  27.   }

  28.  #pragma omp taskwait

  29.  }


Figure 3: Matrix multiplication benchmark implemented with OmpSs@FPGA


For bigger block sizes, the overhead of task creation and management on the host side is, proportionally to the work done, not so high, and the results obtained with the two approaches are providing the same performance: 150 GFlop/s.

References:
[1] R. Nikhil, “Bluespec System Verilog: efficient, correct RTL from high level specifications”, Proceedings of the 2nd ACM and IEEE International Conference on Formal Methods and Models for Co-Design, MEMOCODE ‘04, San Diego, CA, 23-25 June 2004, 10.1109/MEMCOD.2004.1459818.
[2] M. Schoeberl, “Digital Design with Chisel”, Kindle Direct Publishing, 2019, https://github.com/schoeberl/chisel-book.
[3] A. Duran, E. Ayguadé, R.M. Badia, J. Labarta, L. Martinell, X. Martorell, J. Planas, “OmpSs: a Proposal for Programming Heterogeneous Multi-Core Architectures”, Parallel Processing Letters 21(2), pp. 173-193, 2011.
[4] J. Bosch, X. Tan, A. Filgueras, M. Vidal, M. Mateu, D. Jiménez-González, C. Álvarez, X. Martorell, E. Ayguadé, and J. Labarta, “Application Acceleration on FPGAs with OmpSs@FPGA”, Proc. of the 2018 Int. Conf. on Field-Programmable Technology (FPT’18), pp. 70-77, Naha, Japan, December 2018.
[5] AXIOM project, 2015-2018, Horizon 2020 research and innovation programme under grant agreement No 645496, http://www.axiom-project.eu/.