Affordable Access

Adapting a HPC runtime system to FPGAs

Authors
  • Christodoulis, Georgios
Publication Date
Dec 05, 2019
Source
HAL-Descartes
Keywords
Language
English
License
Unknown
External links

Abstract

Along with the traditional CPU cores, processing units of different architectures have been employed by the HPC community in order to obtain improved efficiency and performance. A Field Programmable Gate Arrays - FPGA, is a hardware fabric composed by interconnected re-programmable logic and memory blocks. This type of processing unit, constitutes promising candidate to amplify the computational power of heterogeneous HPC platforms, since due to the reduced amount of abstraction layers between the level of programming and the actual hardware, they can satisfy the aforementioned objectives.However, exploiting them requires an in-depth knowledge of low-level hardware design and high expertise on vendor-provided tools, which is not aligned with the expertise of HPC application programmers. In the scope of this thesis, we have designed a framework that allows a straightforward development of scientific applications over heterogeneous platforms enhanced with FPGA. The orientation of the work is towards a programming environment that requires the minimum knowledge of the underlying architecture, and an FPGA can be used in the same way as any other accelerator. In the core of the environment, there is the StarPU heterogeneous runtime system, that was extended to support FPGA, hiding from the programmer complex operations deriving from the complexity of the underlying architecture while it allows fine control of the performance through different scheduling strategies.For the communication with the FPGA device, we created Conor, a communication library based on RIFFA, that ensures the consistency of the accelerator during scenarios where software threads are interacting with the last concurrently.Our approach is evaluated across two dimensions, one corresponding to the programmability of the framework, and the other to the performance overhead imposed by the additional components attached to the FPGA.The programmability of the framework was evaluated using a basic blocking version of matrix multiplication, which is also used to demonstrate that our development did not impose any additional overhead to the rest of the platform.On top of the first example of matrix multiplication, we created an efficient hardware design of gemm, that will allow the execution of more complex and interesting applications like the Cholesky decomposition.

Report this publication

Statistics

Seen <100 times