Interaction augmented transformer with decoupled decoding for video captioning

Tao Jin,Zhou Zhao,Peng Wang,Jun Yu,Fei Wu
DOI: https://doi.org/10.1016/j.neucom.2022.03.065
IF: 6
2022-07-01
Neurocomputing
Abstract:Transformer-based architectures achieve competitive performances in video captioning. However, their applicability still has many issues: (1) Existing methods only consider the correlation of query and key modalities when calculating the attention weights, and ignore their interaction with other modalities. (2) Deep stacked cross-modal encoding blocks make the different modalities assimilative and lose their preliminary discriminative properties. (3) The decoder usually employs the output of the last encoding block, which is not a comprehensive representation. Based on these concerns, we propose a novel method called Interaction Augmented Transformer (IAT) with discriminative encoding and decoupled decoding for video captioning. Concretely, by concatenating "[CLS]" tokens to multimodal features, we perform reconstructive contrastive constraints for the encoded results. Based on the conclusive information carried by these tokens, we first introduce the global-gated interaction into multi-head attention, where the conclusive information mentioned above is transformed into multiple interaction augmented functions. Additionally, the dot-product operation is replaced by the tucker-fused operation to better capture the query-to-key correlation. Furthermore, we employ fine-grained layer-wise decoding for multi-layer multi-modal features from the encoder with decoupled strategy. We conduct extensive quantitative, qualitative, and ablation experiments on the benchmark datasets and the experimental results show that IAT outperforms the state-of-the-art methods under most of the metrics.
computer science, artificial intelligence
What problem does this paper attempt to address?