Abstract:Extracting temporal and representation features efficiently plays a pivotal role in understanding visual sequence information. To deal with this, we propose a new recurrent neural framework that can be stacked deep effectively. There are mainly two novel designs in our deep RNN framework: one is a new RNN module called Context Bridge Module (CBM) which splits the information flowing along the sequence (temporal direction) and along depth (spatial representation direction), making it easier to train when building deep by balancing these two directions; the other is the Overlap Coherence Training Scheme that reduces the training complexity for long visual sequential tasks on account of the limitation of computing resources. We provide empirical evidence to show that our deep RNN framework is easy to optimize and can gain accuracy from the increased depth on several visual sequence problems. On these tasks, we evaluate our deep RNN framework with 15 layers, 7× than conventional RNN networks, but it is still easy to train. Our deep framework achieves more than 11% relative improvements over shallow RNN models on Kinetics, UCF-101, and HMDB-51 for video classification. For auxiliary annotation, after replacing the shallow RNN part of Polygon-RNN with our 15-layer deep CBM, the performance improves by 14.7%. For video future prediction, our deep RNN improves the state-of-the-art shallow model’s performance by 2.4% on PSNR and SSIM. The code and trained models are published accompanied by this paper: https://github.com/BoPang1996/Deep-RNN-Framework.

Deep Visual-semantic for Crowded Video Understanding

Real-time Semantic Segmentation with Weighted Factorized-Depthwise Convolution

Global and Compact Video Context Embedding for Video Semantic Segmentation

Semantic Annotation for Complex Video Street Views Based on 2D–3D Multi-Feature Fusion and Aggregated Boosting Decision Forests

Spatially-Aware Context Neural Networks.

Learning Semantic Feature Map for Visual Content Recognition

Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification

Semantic Feature Mining for Video Event Understanding.

Deep Common Feature Mining for Efficient Video Semantic Segmentation

Harnessing Object and Scene Semantics for Large-Scale Video Understanding.

Deep RNN Framework for Visual Sequential Applications

Foreground Detection in Surveillance Video with Fully Convolutional Semantic Network

Towards Open-Vocabulary Video Semantic Segmentation

Deep Feature Selection-And-Fusion for RGB-D Semantic Segmentation

STFCN: Spatio-Temporal FCN for Semantic Video Segmentation

Visual Content Recognition by Exploiting Semantic Feature Map with Attention and Multi-task Learning

Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks.

Enhanced 3D convolutional networks for crowd counting

Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification

Unsupervised learning of object semantic parts from internal states of CNNs by population encoding

Attention-guided chained context aggregation for semantic segmentation