MA-VLAD: a fine-grained local feature aggregation scheme for action recognition
Na Feng,Ying Tang,Zikai Song,Junqing Yu,Yi-Ping Phoebe Chen,Wei Yang
DOI: https://doi.org/10.1007/s00530-024-01341-9
IF: 3.9
2024-05-08
Multimedia Systems
Abstract:A recent trend in action recognition involves aggregating local features into a more compact representation to eliminate redundancy in video features while retaining essential components for recognition. An exemplary approach is NetVLAD and its variations, which learn cluster centers for local features and represent them as VLAD descriptors. However, these methods process multi-frame features in a generic and straightforward manner, while overlooking the intricate semantic shifts within consecutive frames. More specifically, they fail to acknowledge that a pivotal aspect of events/actions is the local dynamics of semantic entities. In this paper, we propose Multi-head Attention Modularized VLAD (MA-VLAD) for fine-grained semantic-inclination clustering of features, enhancing VLAD descriptors with a strong local focusing capability. Specifically, we utilize a multi-head mechanism to partition the input features along the channel dimension, and integrate it with the attention mechanism to conduct fine-grained clustering. Additionally, to consolidate temporal information for enhanced recognition, we utilize temporal position embeddings to address order-sensitive events/actions. Our MA-VLAD delivers more dependable video representations than some of the most widely used and potent methods. Extensive experiments on UCF101, HMDB51, and SoccerNet-v2 datasets demonstrate that our MA-VLAD achieves state-of-the-art performance, underscoring its effectiveness.
computer science, information systems, theory & methods