Abstract:Long short-term memory (LSTM) networks are widely used to handle temporal or sequential data, and have great potential for video recognition. Existing LSTM-based video recognition methods either insert LSTM modules at the end of 2D convolutional neural networks (CNNs), called global LSTM methods, or build networks solely by stacking multiple LSTM modules. Unfortunately, these LSTM-based video recognition methods are not competitive, compared to state-of-the-art 3D CNNs or two-stream CNNs. In order to fully explore the potential of LSTM, this paper rethinks its role in video recognition network architectures and proposes a novel Temporal Grafter Network (TGN). Specifically, we develop an efficient and effective variant of convolutional LSTM module, which is grafted between different stages of very deep 2D CNNs for temporal modeling and delivery. Our TGN can capture local motion patterns of varying scales inherent in feature maps from high to low resolutions, while attending to the spatial context information and modeling global temporal dependency across the whole video. The proposed TGN can capture and transmit temporal information throughout very deep 2D CNNs, overcoming the downsides of existing LSTM-based methods and being able to make full use of the potential of LSTM for effective video recognition and early action recognition. We perform extensive ablation study to verify the effectiveness of our proposed methods, and experiments on three widely used video benchmarks show that our methods can achieve performance matching or better than the state-of-the-arts.

Exploration of Network Optimization Strategies Based On the TSN Model.

Temporal Interaction and Excitation for Action Recognition

STA-TSN: Spatial-Temporal Attention Temporal Segment Network for Action Recognition in Video.

Temporal Segment Networks for Action Recognition in Videos

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

TEINet: Towards an Efficient Architecture for Video Recognition.

Temporal Grafter Network: Rethinking LSTM for Effective Video Recognition

A Human Action Recognition Model Inspired By Multiple Scale Temporal Segments Model Fusion

Improve Video Representation with Temporal Adversarial Augmentation

TDSNet: A Temporal Difference Based Network for Video Semantic Segmentation

Not all temporal shift modules are profitable

TFAE: temporal feature adjustable enhancement for video anomaly detection

Temporal Stochastic Linear Encoding Networks.

TDN: Temporal Difference Networks for Efficient Action Recognition

Temporal Modulation Network for Controllable Space-Time Video Super-Resolution

DTCM: Joint Optimization of Dark Enhancement and Action Recognition in Videos

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Behavior Recognition Based on Two-Stream Temporal Relation-Time Pyramid Pooling Network (TTR-TPPN)

Attentional Fused Temporal Transformation Network for Video Action Recognition.

Design Light-weight 3D Convolutional Networks for Video Recognition Temporal Residual, Fully Separable Block, and Fast Algorithm

Temporal Transformer Networks with Self-Supervision for Action Recognition.