SAM: Modeling Scene, Object and Action with Semantics Attention Modules for Video Recognition

Xing Zhang,Zuxuan Wu,Yu-Gang Jiang
DOI: https://doi.org/10.1109/tmm.2021.3050058
IF: 7.3
2022-01-01
IEEE Transactions on Multimedia
Abstract:Video recognition aims at understanding semantic contents that normally involve the interactions of humans and related objects under certain scenes. A common practice to improve recognition accuracy is to combine object, scene and action features for classification directly, assuming that they are explicitly complementary. In this paper, we break down the fusion of three features into two pairwise feature relation modeling processes, which mitigates the difficulty of correlation learning in high dimensional features. Towards this goal, we introduce a Semantics Attention Module that captures the relations of a pair of features by refining the relatively "weak" feature with the guidance from the "strong" feature using attention mechanisms. The refined representation is further combined with the "strong" feature using a residual design for downstream tasks. Two SAMs are applied in a Semantics Attention Network (SAN) for improving video recognition. Extensive experiments are conducted on two large-scale video benchmarks, FCVID and ActivityNet v1.3-the proposed approach achieves better results while requiring much less computational effort than alternative methods.
What problem does this paper attempt to address?