Abstract:Long short-term memory (LSTM) is a type of powerful deep neural network that has been widely used in many sequence analysis and modeling applications. However, the large model size problem of LSTM networks make their practical deployment still very challenging, especially for the video recognition tasks that require high-dimensional input data. Aiming to overcome this limitation and fully unlock the potentials of LSTM models, in this paper we propose to perform algorithm and hardware co-design towards high-performance energy-efficient LSTM networks. At algorithm level, we propose to develop fully decomposed hierarchical Tucker (FDHT) structure-based LSTM, namely FDHT-LSTM, which enjoys ultra-low model complexity while still achieving high accuracy. In order to fully reap such attractive algorithmic benefit, we further develop the corresponding customized hardware architecture to support the efficient execution of the proposed FDHT-LSTM model. With the delicate design of memory access scheme, the complicated matrix transformation can be efficiently supported by the underlying hardware without any access conflict in an on-the-fly way. Our evaluation results show that both the proposed ultra-compact FDHT-LSTM models and the corresponding hardware accelerator achieve very high performance. Compared with the state-of-the-art compressed LSTM models, FDHT-LSTM enjoys both order-of-magnitude reduction in model size and significant accuracy improvement across different video recognition datasets. Meanwhile, compared with the state-of-the-art tensor decomposed model-oriented hardware TIE, our proposed FDHT-LSTM architecture achieves better performance in throughput, area efficiency and energy efficiency, respectively on LSTM-Youtube workload. For LSTM-UCF workload, our proposed design also outperforms TIE with higher throughput, higher energy efficiency and comparable area efficiency.

Context-LSTM: a robust classifier for video detection on UCF101

Behavior recognition based on the improved density clustering and context-guided Bi-LSTM model

Human Action Recognition From Digital Videos Based on Deep Learning.

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context.

Dynamic Context Removal: A General Training Strategy for Robust Models on Video Action Predictive Tasks

DB-LSTM: Densely-connected Bi-directional LSTM for Human Action Recognition

Deep Multi-Kernel Convolutional LSTM Networks and an Attention-Based Mechanism for Videos

Spatial–Temporal Context-Aware Online Action Detection and Prediction

Efficient Video Action Detection with Token Dropout and Context Refinement.

Lattice Long Short-Term Memory for Human Action Recognition

Temporally Identity-Aware SSD With Attentional LSTM

Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition

ContextDet: Temporal Action Detection with Adaptive Context Aggregation

Video RWKV:Video Action Recognition Based RWKV

Algorithm and Hardware Co-Design of Energy-Efficient LSTM Networks for Video Recognition with Hierarchical Tucker Tensor Decomposition

Next frame prediction using ConvLSTM

TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition

A novel model for fall detection and action recognition combined lightweight 3D-CNN and convolutional LSTM networks

Human action recognition based on quaternion spatial-temporal convolutional neural network and LSTM in RGB videos

Exploiting Objects with LSTMs for Video Categorization

Online Action Tube Detection Via Resolving The Spatio-Temporal Context Pattern