Abstract:Deep convolutional neural networks (DNNs) have been widely used in many applications, particularly in machine vision. It is challenging to accelerate DNNs on embedded systems because real-world machine vision applications should reserve a lot of external memory bandwidth for other tasks, such as video capture and display, while leaving little bandwidth for accelerating DNNs. In order to solve this issue, in this study, we propose a high-throughput accelerator, called reconfigurable tiny neural network accelerator (ReTiNNA), for the bandwidth-limited system and present a real-time object detection system for the high-resolution video image. We first present a dedicated computation engine that takes different data mapping methods for various filter types to improve data reuse and reduce hardware resources. We then propose an adaptive layer-wise tiling strategy that tiles the feature maps into strips to reduce the control complexity of data transmission dramatically and to improve the efficiency of data transmission. Finally, a design space exploration (DSE) approach is presented to explore design space more accurately in the case of insufficient bandwidth to improve the performance of the low-bandwidth accelerator. With a low bandwidth of 2.23 GB/s and a low hardware consumption of 90.261K LUTs and 448 DSPs, ReTiNNA can still achieve a high performance of 155.86 GOPS on VGG16 and 68.20 GOPS on ResNet50, which is better than other state-of-the-art designs implemented on FPGA devices. Furthermore, the real-time object detection system can achieve a high object detection speed of 19 fps for high-resolution video.

Optimizing Inference Quality with SmartNIC for Recommendation System

Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training

MicroRec: Efficient Recommendation Inference by Hardware and Data Structure Solutions

Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs

High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System

Fleche: an efficient GPU embedding cache for personalized recommendations

ESPN: Memory-Efficient Multi-Vector Information Retrieval

RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing

A Flexible Embedding-Aware Near Memory Processing Architecture for Recommendation System

NDRec: A Near-Data Processing System for Training Large-Scale Recommendation Models

Mixed-Precision Embedding Using a Cache

DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

Accelerating Recommendation System Training by Leveraging Popular Choices

Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations

Near-Memory Processing in Action: Accelerating Personalized Recommendation with AxDIMM.

Dynamic Space-Time Scheduling for GPU Inference

Disaggregating Embedding Recommendation Systems with FlexEMR

UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture

RecSSD: near data processing for solid state drive based recommendation inference

A Comprehensive Study on Optimizing Systems with Data Processing Units