Abstract:Human-robot collaborative assembly (HRCA) is one of the current trends of intelligent manufacturing, and assembly action recognition is the basis of and the key to HRCA. A multi-scale and multi-stream graph convolutional network (2MSGCN) for assembly action recognition is proposed in this paper. 2MSGCN takes the temporal skeleton sample as input and outputs the class of the assembly action to which the sample belongs. RGBD images of the operator performing the assembly actions are captured by three RGBD cameras mounted at different viewpoints and pre-processed to generate the complete human skeleton. A multi-scale and multi-stream (2MS) mechanism and a feature fusion mechanism are proposed to improve the recognition accuracy of 2MSGCN. The 2MS mechanism is designed to input the skeleton data to 2MSGCN in the form of a joint stream, a bone stream and a motion stream, while the joint stream further generates two sets of input with rough scales to represent features in higher dimensional human skeleton, which obtains information of different scales and streams in temporal skeleton samples. And the feature fusion mechanism enables the fused feature to retain the information of the sub-feature while incorporating union information between the sub-features. Also, the improved convolution operation based on Ghost module is introduced to the 2MSGCN to reduce the number of the parameters and floating-point operations (FLOPs) and improve the real-time performance. Considering that there will be transitional actions when the operator switches between assembly actions in the continuous assembly process, a transitional action classification (TAC) method is proposed to distinguish the transitional actions from the assembly actions. Experiments on the public dataset NTU RGB+D +D 60 (NTU 60) and a self-built assembly action dataset indicate that the proposed 2MSGCN outperforms the mainstream models in recognition accuracy and real-time performance.

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Semi-supervised human action recognition via dual-stream cross-fusion and class-aware memory bank

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

MAFormer: A cross-channel spatio-temporal feature aggregation method for human action recognition

Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

Human-centric multimodal fusion network for robust action recognition

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition

A Skeleton-Based Assembly Action Recognition Method with Feature Fusion for Human-Robot Collaborative Assembly

Pose-conditioned Spatio-Temporal Attention for Human Action Recognition

Cmf-transformer: cross-modal fusion transformer for human action recognition

Using a Selective Ensemble Support Vector Machine to Fuse Multimodal Features for Human Action Recognition

Attention-Based Multilevel Co-Occurrence Graph Convolutional LSTM for 3-D Action Recognition

Multimodal human action recognition based on spatio-temporal action representation recognition model

Skeleton-Indexed Deep Multi-Modal Feature Learning for High Performance Human Action Recognition

Multidomain Multimodal Fusion For Human Action Recognition Using Inertial Sensors

Human Action Recognition Based on Improved Fusion Attention CNN and RNN

Structural feature representation and fusion of human spatial cooperative motion for action recognition

Action Recognition from Depth Sequences Using Weighted Fusion of 2D and 3D Auto-Correlation of Gradients Features

Evaluating fusion of RGB-D and inertial sensors for multimodal human action recognition

Spatio-Temporal Adaptive Network with Bidirectional Temporal Difference for Action Recognition