A simulated two-stream network via multilevel distillation of reviewed features and decoupled logits for video action recognition

Wang, Anna K.
DOI: https://doi.org/10.1007/s00371-024-03638-2
IF: 2.835
2024-10-22
The Visual Computer
Abstract:Video-based action recognition, which models human activities and classifies actions from video frame sequences, offers significant utility in computer vision and computer graphics applications. The methods based on two-stream architecture are commonly used, which consist of an appearance stream for RGB video frames and a motion stream for optical flow frames. Due to the high computational complexity of optical flow, recent studies have attempted to use optical flow only in the training phase and avoid computing optical flow in the testing phase. However, these approaches fail to fully leverage the benefits of optical flow during training, leading to limited improvements in accuracy. To address this issue, we introduce a Simulated Two-Stream Network (STS-Net), utilizing a multilevel knowledge distillation approach to capture motion representations. First, we try to distill motion feature of optical flow across various levels through a review mechanism, thereby capturing both low-level feature and high-level semantic information. Second, we apply a decoupled logit distillation loss to obtain a more comprehensive knowledge transfer. Additionally, we analyze the role of the activation function in fusing the two streams, and propose an effective fusion strategy named "ActivNo." The experimental results on benchmark datasets (i.e., HMDB51, UCF101, and Kinetics400) demonstrated that the exploited model STS-Net achieves superior performance, surpassing comparable methods in terms of both efficiency and accuracy. The code is openly available at https://github.com/BiBiKo219/STS-Net.
computer science, software engineering
What problem does this paper attempt to address?