Abstract:Data representation learning is one of the most important problems in machine learning. Unsupervised representation learning becomes meritorious as it has no necessity of label information with observed data. Due to the highly time-consuming learning of deep-learning models, there are many machine-learning models directly adapting well-trained deep models that are obtained in a supervised and end-to-end manner as feature abstractors to distinct problems. However, it is obvious that different machine-learning tasks require disparate representation of original input data. Taking human action recognition as an example, it is well known that human actions in a video sequence are 3-D signals containing both visual appearance and motion dynamics of humans and objects. Therefore, the data representation approaches with the capabilities to capture both spatial and temporal correlations in videos are meaningful. Most of the existing human motion recognition models build classifiers based on deep-learning structures such as deep convolutional networks. These models require a large quantity of training videos with annotations. Meanwhile, these supervised models cannot recognize samples from the distinct dataset without retraining. In this article, we propose a new 3-D deconvolutional network (3DDN) for representation learning of high-dimensional video data, in which the high-level features are obtained through the optimization approach. The proposed 3DDN decomposes the video frames into spatiotemporal features under a sparse constraint in an unsupervised way. In addition, it also can be regarded as a building block to develop deep architectures by stacking. The high-level representation of input sequential data can be used in multiple downstream machine-learning tasks, we evaluate the proposed 3DDN and its deep models in human action recognition. The experimental results from three datasets: 1) KTH data; 2) HMDB-51; and 3) UCF-101, demonstrate that the-proposed 3DDN is an alternative approach to feedforward convolutional neural networks (CNNs), that attains comparable results.

Ensembling 3d Cnn Framework For Video Recognition

Recurrent Volume-Based 3-D Feature Fusion for Real-Time Multiview Object Pose Estimation.

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

DC3D: A Video Action Recognition Network Based on Dense Connection

Vehicle Behavior Recognition using Multi-Stream 3D Convolutional Neural Network

An Enhanced 3dcnn-Convlstm For Spatiotemporal Multimedia Data Analysis

Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition

TEINet: Towards an Efficient Architecture for Video Recognition.

Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

Recurrent Volume-based 3D Feature Fusion for Real-time Multi-view Object Pose Estimation

Trunk-Branch Ensemble Convolutional Neural Networks for Video-Based Face Recognition

Face Recognition Using $Sf_{3}CNN$ With Higher Feature Discrimination

Intelligent 3D Network Protocol for Multimedia Data Classification using Deep Learning

Enhanced Action Recognition With Visual Attribute-Augmented 3D Convolutional Neural Network

Maximizing Spatio-Temporal Entropy of Deep 3D CNNs for Efficient Video Recognition

3-D Deconvolutional Networks for the Unsupervised Representation Learning of Human Motions

Empowering Efficient Spatio-Temporal Learning with a 3D CNN for Pose-Based Action Recognition

3D Convolutional Neural Network for Human Behavior Analysis in Intelligent Sensor Network

Human Action Recognition Combining Sequential Dynamic Images and Two-Stream Convolutional Network

Visual Attribute-augmented Three-dimensional Convolutional Neural Network for Enhanced Human Action Recognition.

Facial Expression Recognition in Video Using 3D-CNN Deep Features Discrimination