Affordable Access

Efficient, Scalable and High-Throughput Runtime Reconfigurable Arrays for Accelerator as a Service

Authors
  • Nagi, Sumeet Singh
Publication Date
Jan 01, 2022
Source
eScholarship - University of California
Keywords
Language
English
License
Unknown
External links

Abstract

Advancements in silicon processing are responsible for the exponential growth in computing performance and algorithmic development. With the end of Dennard scaling, conventional computing architectures, like CPU, are unable to keep up with the increasing computation requirements of modern algorithms. Hardware accelerators are designed for each such computation-heavy algorithm and incorporated into the system; a modern System-On-Chip (SoC) for phones can have up to 30 different accelerators. Modern high-compute applications such as 5G, machine learning, and autonomous driving vehicles require accelerators to keep up with their rapidly evolving standards and computation needs. However, with the rising design costs at newer technology nodes, the iterative development of inflexible accelerators becomes prohibitively expensive. Reconfigurable architectures, with their ability to adapt to rapidly-evolving standards as well as their ability to accommodate several such high-performance applications in the system, provide an ideal solution. The motivation of this dissertation is to develop such a Coarse Grain Reconfigurable Architecture called Universal Digital Signal Processor (UDSP) which could replace accelerator blocks in an SoC, and develop a hardware management system to enable concurrent multiprogram functionalities in the reconfigurable architectures. UDSP consists of 196 Compute Elements (CEs) and a statistics-based scalable, delayless, high speed routing network. It is developed using an algorithm-driven framework to allow for faster development of each successive revision of the design. The tileable and scalable nature of UDSP allowed us to put together 4 UDSP dies on a 10�m fine-pitch interposer Silicon Interconnect Fabric, as a 2�2 UDSP Multi-Chip Module (MCM), quadrupling the number of compute resources. The UDSP 2 � 2 assembly has a peak throughput of 3,450 Giga-Operations per second (GOPs) or 1,725 Giga-Multiply Accumulates per second (GMACs) at 1.1GHz clock frequency while consuming 6W power including 0.38pJ/bit to transfer data across dies in TSMC 16nm. It achieves a peak efficiency of 785GMACs/J (0.42V, 315MHz). UDSP lies within 4.2� energy efficiency and 6.4� area efficiency gap relative to ASICs at nominal operation conditions (0.8V, 1.1GHz).Multiprogram tenancy on conventional reconfigurable arrays requires high manual effort from the programmer to foresee and account for runtime program dynamics during compilation. The inability to predict runtime and multiprogram dynamics places the recompilation time of programs in the critical timing path, leading to long reconfiguration times, poor active resource utilization, and low acceleration performance. We developed an active hardware resource management system for reconfigurable arrays that automatically accounts for multi-program dynamics at runtime, eases the workload of the programmer, and improves the array’s performance. These hardware management techniques enable dynamic runtime relocation of programs on the Runtime Reconfigurable Array (RTRA) with minimal reconfiguration latency overhead, which allows the array to offer Accelerator as a Service (ACAS). The ACAS architecture virtualizes the array by spatially and temporally scheduling multiple programs on its available resources, thus achieving higher active utilization for the mapped programs on the array. ACAS allows developers to compile programs for acceleration on reconfigurable array without requiring additional manual steps for runtime resource planning at compile time. Provided with high program pressure, ACAS exceeds 90% active utilization of arrays. For signal processing workloads, our simulated 9�12 RTRA uses a 3� smaller area and delivers 3.2−4.3� more throughput than a 18�18 UDSP and the 18�18 RTRA delivers 8 − 14� more throughput as compared to its equivalent-sized 18 � 18 UDSP counterpart.

Report this publication

Statistics

Seen <100 times