Abstract:RGB and depth modalities contain more abundant and interactive information, and convolutional neural networks (ConvNets) based on multi-modal data have achieved successful progress in action recognition. Due to the limitation of a single stream, it is difficult to improve recognition performance by learning multi-modal interactive features. Inspired by the multi-stream learning mechanism and spatial-temporal information representation methods, we construct dynamic images by using the rank pooling method and design an interactive learning dual-ConvNet (ILD-ConvNet) with a multiplexer module to improve action recognition performance. Built on the rank pooling method, the constructed visual dynamic images can capture the spatial-temporal information from entire RGB videos. We extend this method to depth sequences to obtain more abundant multi-modal spatial-temporal information as the inputs of the ConvNets. In addition, we design a dual ILD-ConvNet with multiplexer modules to jointly learn the interactive features of two-stream from RGB and depth modalities. The proposed recognition framework has been tested on two benchmark multi-modal datasets—NTU RGB + D 120 and PKU-MMD. The proposed ILD-ConvNet with a temporal segmentation mechanism achieves an accuracy of 86.9% and 89.4% for Cross-Subject (C-Sub) and Cross-Setup (C-Set) on NTU RGB + D 120, 92.0% and 93.1% for Cross-Subject (C-Sub) and Cross-View (C-View) on PKU-MMD, which are comparable with the state of the art. The experimental results shown that our proposed ILD-ConvNet with a multiplexer module can extract interactive features from different modalities to enhance action recognition performance.

Rank Pooling Dynamic Network: Learning End-to-end Dynamic Characteristic for Action Recognition

Rank Pooling for Action Recognition

End-to-end Video-level Representation Learning for Action Recognition

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Multi-scale residual network model combined with Global Average Pooling for action recognition

Action Representation Using Classifier Decision Boundaries

Ordered Pooling of Optical Flow Sequences for Action Recognition

Human Action Recognition Combining Sequential Dynamic Images and Two-Stream Convolutional Network

DC3D: A Video Action Recognition Network Based on Dense Connection

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Temporal Distinct Representation Learning for Action Recognition

Empowering Efficient Spatio-Temporal Learning with a 3D CNN for Pose-Based Action Recognition

Neural Koopman Pooling: Control-Inspired Temporal Dynamics Encoding for Skeleton-Based Action Recognition

An Improved Graph Pooling Network for Skeleton-Based Action Recognition

Deep multiple aggregation networks for action recognition

Temporal Pyramid Pooling-Based Convolutional Neural Network for Action Recognition

Dynamic Sampling Networks for Efficient Action Recognition in Videos.

Interactive Learning of a Dual Convolution Neural Network for Multi-Modal Action Recognition

Deep Local Video Feature for Action Recognition

Order-aware Convolutional Pooling for Video Based Action Recognition

Multi-Level Recurrent Residual Networks for Action Recognition