Trear: Transformer-Based RGB-D Egocentric Action Recognition

Xiangyu Li,Yonghong Hou,Pichao Wang,Zhimin Gao,Mingliang Xu,Wanqing Li
DOI: https://doi.org/10.1109/tcds.2020.3048883
IF: 4.546
2022-03-01
IEEE Transactions on Cognitive and Developmental Systems
Abstract:In this article, we propose a transformer-based RGB-D egocentric action recognition framework, called Trear. It consists of two modules: 1) interframe attention encoder and 2) mutual-attentional fusion block. Instead of using optical flow or recurrent units, we adopt a self-attention mechanism to model the temporal structure of the data from different modalities. Input frames are cropped randomly to mitigate the effect of the data redundancy. Features from each modality are interacted through the proposed fusion block and combined through a simple yet effective fusion operation to produce a joint RGB-D representation. Empirical experiments on two large egocentric RGB-D data sets: 1) THU-READ and 2) first-person hand action, and one small data set, wearable computer vision systems, have shown that the proposed method outperforms the state-of-the-art results by a large margin.
robotics,computer science, artificial intelligence,neurosciences
What problem does this paper attempt to address?