Recognizing Video Activities in the Wild Via View-to-Scene Joint Learning
Jiahui Yu,Yifan Chen,Xuna Wang,Xu Cheng,Zhaojie Ju,Yingke Xu
DOI: https://doi.org/10.1109/tase.2024.3431128
IF: 6.636
2024-01-01
IEEE Transactions on Automation Science and Engineering
Abstract:Recognizing video actions in the wild is challenging for visual control systems. In-the-wild videos show actions not seen in training data, recorded from various angles and scenes with the same labels. Most existing methods address this challenge by developing complex frameworks to extract spatiotemporal features. To achieve view robustness and scene generalization cost-effectively, we explore view consistency and scene joint understanding. Based on this, we propose a neural network (called Wild-VAR) to learn view and scene information jointly without any 3D pose ground truth labels, a new approach to recognizing video actions in the wild. Unlike most existing methods, first, we propose a Cubing module to self-learn body consistency between views instead of comprehensive image features, boosting the generalization performance of across-view settings. Specifically, we map 3D representations to multiple 2D features and then adopt a self-adaptive scheme to constrain 2D features from different perspectives. Moreover, we propose temporal neural networks (called T-Scene) to develop a recognizing framework, enabling Wild-VAR to flexibly learn scenes across time, including key interactors and context, in video sequences. Extensive experiments show that Wild-VAR consistently outperforms state-of-the-art methods on four benchmarks. Notably, with only half the computation costs, Wild-VAR improves accuracy by 2.2% and 1.3% on the Kinetics-400 and the Something-Somthing V2 datasets, respectively.