Mitigating background bias in self-supervised video representation learning

Senturk, Ufuk Umut
DOI: https://doi.org/10.1007/s11760-024-03644-w
IF: 1.583
2024-12-05
Signal Image and Video Processing
Abstract:This paper addresses the problem of self-supervised video representation learning focused on motion features, aiming to capture features from foreground motion with reduced reliance on background bias. Recent successful methods often employ instance discrimination approaches, which entail heavy computation and may lead to inefficient and exhaustive pretraining. To this end, we utilize the augmentation technique MAC: Mask-Augmentation teChnique. MAC blends foreground motion using frame-difference-based masks and sets up a pretext task to recognize the applied transformation. By incorporating a game of predicting the correct blending multiplier at the pretraining stage, our model is compelled to encode motion-based features, which are then successfully transferred to downstream tasks such as action recognition. Moreover, we expand our approach within a joint contrastive framework and integrate additional tasks in the spatial and temporal domains to further enhance representation capabilities. Experimental results demonstrate that our method achieves superior performance on the UCF-101, HMDB51 and Diving-48 datasets under low-resource settings and competitive results with instance discrimination methods under costly computation settings.
engineering, electrical & electronic,imaging science & photographic technology
What problem does this paper attempt to address?