Abstract:Graphics Processing Units (GPUs) have emerged as the predominant hardware platforms for massively parallel computing. However, their inherent von-Neumann architecture still suffers performance inefficiency stemming from the sequential instruction execution and frequent data transfer overheads within the memory system. These intrinsic architectural flaws lead to heavy overhead on the latency, area, and energy efficiency, rendering GPUs suboptimal for edge computing applications. To tackle these challenges, this paper introduces a novel circular Reconfigurable Parallel Processor (RPP) to enable massively parallel applications in edge computing with high efficiency. RPP features a novel circular array of reconfigurable compute engines, enabling efficient streaming dataflow processing. In contrast to traditional Coarse Grained Reconfigurable Architecture (CGRA), the circular network topology of RPP is formed by linear switch networks with an innovative gasket memory, which reduces complicated network routing overheads while allowing versatile datapath mapping and optimized data reuse. A dedicated hierarchical memory system is proposed to support different memory access patterns and address mapping strategies, enabling flexible data access with high memory efficiency. Several hardware optimizations are further introduced to improve hardware utilization and performance such as concurrent kernel execution, register split&refill and heterogeneous scalar&vector computing. To fully utilize the hardware capability of RPP, we develop an end-to-end software stack consisting of a compiler, runtime environment, and different RPP libraries. This software stack is designed to be compatible with the GPGPU computing paradigm, enhancing its potential for broader adoption. Fabricated in a 14nm process, RPP occupies an area of 119 mm2 and operates at a maximum power of 15W with a 1GHz clock frequency. From the runtime measurement of various workloads, RPP achieves up to 27.5 × higher energy efficiency than Nvidia edge GPUs in deep learning inference and up to 14062 × lower latency than AMD Ryzen 5 CPU in linear algebra operations.

Sustainable AI Processing at the Edge

Towards Memory-Efficient Inference in Edge Video Analytics

AI on the Edge: Rethinking AI-based IoT Applications Using Specialized Edge Architectures

Edge AI without Compromise: Efficient, Versatile and Accurate Neurocomputing in Resistive Random-Access Memory

Dynamic Performance and Power Optimization with Heterogeneous Processing-in-Memory for AI Applications on Edge Devices

Efficient Hardware Acceleration Techniques for Deep Learning on Edge Devices: A Comprehensive Performance Analysis

Accelerating Neural Network Inference with Processing-in-DRAM: From the Edge to the Cloud

Green Edge AI: A Contemporary Survey

Benchmarking Edge AI Platforms for High-Performance ML Inference

Research on Convolutional Neural Network Inference Acceleration and Performance Optimization for Edge Intelligence

Accelerating Mobile Applications at the Network Edge with Software-Programmable FPGAs.

Heterogeneous Computing for Edge AI

Mobile or FPGA? A Comprehensive Evaluation on Energy Efficiency and a Unified Optimization Framework

Reaching for the Sky: Maximizing Deep Learning Inference Throughput on Edge Devices with AI Multi-Tenancy

Edge-PRUNE: Flexible Distributed Deep Learning Inference

Sustainable edge computing: Challenges and future directions

Multi-user Co-inference with Batch Processing Capable Edge Server

Low-Power Ultra-Small Edge AI Accelerators for Image Recognition with Convolution Neural Networks: Analysis and Future Directions

AI Tax: The Hidden Cost of AI Data Center Applications

Analysing Edge Computing Devices for the Deployment of Embedded AI

Circular Reconfigurable Parallel Processor for Edge Computing : Industrial Product ✶