Enhancing Motion Visual Cues for Self-Supervised Video Representation Learning

Mu Nie,Zhibin Quan,Weiping Ding,Wankou Yang
DOI: https://doi.org/10.1016/j.engappai.2023.106203
IF: 8
2023-01-01
Engineering Applications of Artificial Intelligence
Abstract:Building the general feature from unlabeled videos is the core of self-supervised video representation learning. However, recent research on video representation focuses on static visual pixel information, which makes these models unable to capture the dynamics in the temporal dimension. In order to solve the aforementioned issue and improve the generalization of the model, we propose an enhanced motion visual cue (EMVC) method for self-supervised video representation learning to reduce the background bias and increase the motion information. Our EMVC includes a background replacing module and a foreground fixing module that leverages the foreground and background of the original video sequence to make the background of same-action videos less identical and their motion cues more distinct. Experimental results show that the proposed method effectively reduces the biases in the background and significantly improves the video's ability to comprehend motion information, leading to an increase in recognition accuracy. Specifically, the EMVC method achieved an accuracy of 84.6%, 53.1%, and 68.3% on the UCF101, HMDB51, and Diving48 datasets, respectively, outperforming the existing algorithms. Additionally, significant improvements were obtained in the video retrieval task on benchmark datasets, with an average improvement of over 14.6%.
What problem does this paper attempt to address?