Abstract:Due to the restricted on-chip computing capability for deep neural network (DNN) processing, high-definition video recognition (VOR) task is not easily achievable as a real-time task in a consumer SoC. Despite the fact that many accelerators have been proposed for fast VOR, they remain isolated from a video decoder’s inherent video compression knowledge. Therefore, in this article, we propose a video decoder-assisted neural network acceleration framework for real-time video recognition. First, given the fact that the nonkey frames can be dynamically reconstructed by the key frames with high fidelity during video compression, we propose the VR-DANN algorithm that reconstructs the VOR results of nonkey frames in a similar way so as to save a large amount of NN computing power. In VR-DANN, we leverage motion vectors, the tempo-spatial information already available in the video decoding process to facilitate the recognition process, and propose a lightweight NN-based refinement scheme to suppress the nonpixel recognition noise. Moreover, we consider that there is numerous redundant information in the video frames because the objects of interest usually take a small portion in a video frame. We, therefore, propose the object-based acceleration algorithm (Jigsaw-VOR) to avoid unnecessary computation by dropping out the redundant information in the frames before going through the computing-intensive DNN process. Concretely, we adopt the motion vectors to track the rough position for the objects of interest and then merge them into a consolidated frame for DNN processing like a jigsaw game. The acceleration comes from the processing of much fewer consolidated frames compared to the raw frames in a video stream. The VR-DANN and Jigsaw-VOR can be integrated for further speedup. From the hardware side, we propose the VR-DANN and Jigsaw-VOR architectures to, respectively, accelerate the VR-DANN and Jigsaw-VOR algorithms. These two architectures can be combined to gain higher performance improvement. Our experimental results show that the VR-DANN architecture achieves $2.9\times $ performance improvement with less than 1% accuracy loss compared with the state-of-the-art “FAVOS” scheme. In addition, the experimental results show that applying Jigsaw-VOR to all frames can achieve $2.4\times $ performance improvement with comparable accuracy compared to FAVOS. By combining VR-DANN and Jigsaw-VOR schemes, the performance improvement can reach up to $3.6\times $ .

Recurrent Residual Module for Fast Inference in Videos

Frame Prediction Using Recurrent Convolutional Encoder with Residual Learning

RT-VENet: A Convolutional Network for Real-time Video Enhancement.

FASTER Recurrent Networks for Efficient Video Classification

Deep RNN Framework for Visual Sequential Applications

Recurrent Convolutional Neural Network for Video Classification.

EvConv: Fast CNN Inference on Event Camera Inputs For High-Speed Robot Perception

14.2 A 65nm 24.7µj/frame 12.3mw Activation-Similarity-Aware Convolutional Neural Network Video Processor Using Hybrid Precision, Inter-Frame Data Reuse and Mixed-Bit-Width Difference-Frame Data Codec

Rapid-INR: Storage Efficient CPU-free DNN Training Using Implicit Neural Representation

DeltaCNN: End-to-End CNN Inference of Sparse Frame Differences in Videos

Design Light-weight 3D Convolutional Networks for Video Recognition Temporal Residual, Fully Separable Block, and Fast Algorithm

Adaptive Focus for Efficient Video Recognition

Dual-module Inference for Efficient Recurrent Neural Networks

14.2 A 65nm 24.7 µJ/Frame 12.3 mW Activation-Similarity-Aware Convolutional Neural Network Video Processor Using Hybrid Precision, Inter-Frame Data Reuse and Mixed-Bit-Width …

Real-Time Video Recognition Via Decoder-Assisted Neural Network Acceleration Framework

ResMap: Exploiting Sparse Residual Feature Map for Accelerating Cross-Edge Video Analytics.

RN-VID: A Feature Fusion Architecture for Video Object Detection

A 65-Nm Energy-Efficient Interframe Data Reuse Neural Network Accelerator for Video Applications

MoViNets: Mobile Video Networks for Efficient Video Recognition

A Flexible Recurrent Residual Pyramid Network for Video Frame Interpolation