Cross-modal Token Selection for Video Understanding.

Liyong Pan,Zechao Li,Henghao Zhao,Rui Yan
DOI: https://doi.org/10.1145/3552458.3556444
2022-01-01
Abstract:Multi-modal action recognition is an essential task in human-centric machine learning. Humans perceive the world by processing and fusing information of multiple modalities such as vision and audio. We introduce a novel transformer-based multi-modal architecture that outperforms existing state-of-the-art methods while significantly reducing the computational cost. The key to our idea is a Token-Selector module that collates and condenses the most useful token combinations and only shares what is necessary for cross-modal modeling. We conduct extensive experiments on multiple multi-modal benchmark datasets and achieve state-of-the-art performance under similar experimental conditions while reducing 30 percent of computing consumption. Extensive ablation studies showcase the benefits of our improved method over naive approaches.
What problem does this paper attempt to address?