Abstract:Abstract This paper addresses the recognitions of human actions in videos. Human action recognition can be seen as the automatic labeling of a video according to the actions occurring in it. It has become one of the most challenging and attractive problems in the pattern recognition and video classification fields. The problem itself is difficult to solve by traditional video processing methods because of several challenges such as the background noise, sizes of subjects in different videos, and the speed of actions. Derived from the progress of deep learning methods, several directions are developed to recognize a human action from a video, such as the long-short-term memory (LSTM)-based model, two-stream convolutional neural network (CNN) model, and the convolutional 3D model.In this paper, we focus on the two-stream structure. The traditional two-stream CNN network solves the problem that CNNs do not have satisfactory performance on temporal features. By training a temporal stream, which uses the optical flow as the input, a CNN can have the ability to extract temporal features. However, the optical flow only contains limited temporal information because it only records the movements of pixels on the x -axis and the y -axis. Therefore, we attempt to design and implement a new two-stream model by using an LSTM-based model in its spatial stream to extract both spatial and temporal features in RGB frames. In addition, we implement a DenseNet in the temporal stream to improve the recognition accuracy. This is in-contrast to traditional approaches which typically utilize the spatial stream for extracting only spatial features. The quantitative evaluation and experiments are conducted on the UCF-101 dataset, which is a well-developed public video dataset. For the temporal stream, we choose the optical flow of UCF-101. Images in the optical flow are provided by the Graz University of Technology. The experimental result shows that the proposed method outperforms the traditional two-stream CNN method with an accuracy of at least 3%. For both spatial and temporal streams, the proposed model also achieves higher recognition accuracies. In addition, compared with the state of the art methods, the new model can still have the best recognition performance.

Two-stream 2D/3D Residual Networks for Learning Robot Manipulations from Human Demonstration Videos

Learning Robot Manipulation Skills from Human Demonstration Videos Using Two-Stream 2-D/3-D Residual Networks with Self-Attention

Human-Robot Sign Language Motion Retargeting from Videos

Learning Actions from Human Demonstration Video for Robotic Manipulation

An Object Attribute Guided Framework for Robot Learning Manipulations from Human Demonstration Videos

Vision-based Robot Manipulation Learning via Human Demonstrations

Learning Generalizable 3D Manipulation With 10 Demonstrations

A Human–Robot Collaboration Method Using a Pose Estimation Network for Robot Learning of Assembly Manipulation Trajectories From Demonstration Videos

A Multi-modal Framework for Robots to Learn Manipulation Tasks from Human Demonstrations

A Spatio-Temporal Transformer Network for Human Motion Prediction in Human-Robot Collaboration

Translating Videos to Commands for Robotic Manipulation with Deep Recurrent Neural Networks

Human-oriented Representation Learning for Robotic Manipulation

V2CNet: A Deep Learning Framework to Translate Videos to Commands for Robotic Manipulation

Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction

Vision-Based Efficient Robotic Manipulation with a Dual-Streaming Compact Convolutional Transformer

Giving Robots a Hand: Learning Generalizable Manipulation with Eye-in-Hand Human Video Demonstrations

Vision-Based Multi-Task Manipulation for Inexpensive Robots Using End-To-End Learning from Demonstration

Learning Latent Object-Centric Representations for Visual-Based Robot Manipulation

Scaling Manipulation Learning with Visual Kinematic Chain Prediction

Learning Robotic Manipulation from Demonstrations by Combining Deep Generative Model and Dynamic Control System

Improved two-stream model for human action recognition