Two-Stream Temporal Feature Aggregation Based on Clustering for Few-Shot Action Recognition

Long Deng,Ao Li,Bingxin Zhou,Yongxin Ge
DOI: https://doi.org/10.1109/lsp.2024.3456670
2024-09-21
IEEE Signal Processing Letters
Abstract:The metric learning paradigm has achieved notable success in few-shot action recognition; however, it faces unaddressed challenges. Specifically, (1) limited training data could impede the exploration of temporal action relations, and (2) precision would decline from the presence of outliers during the frame-level feature alignment. To address the challenges, we propose a two-stream temporal feature aggregation method based on clustering, incorporating a temporal augmentation module (TAM) and a feature aggregation module (FAM). The TAM adeptly integrates three consecutive grayscale frames into the original RGB frame through weighted summation, thereby addressing the color-related misguidance and enhancing the temporal information extraction. Meanwhile, the FAM employs clustering to aggregate the frame-level features into high semantic sub-actions and replaces the original features with cluster centers to mitigate the adverse impact of outliers on the model performance. Experimental results on benchmark datasets demonstrate the effectiveness of our method in few-shot action recognition. We validate our proposed approach by conducting comprehensive ablation experiments.
engineering, electrical & electronic
What problem does this paper attempt to address?