Some Can Be Better Than All: Multimodal Star Transformer for Visual Dialog

Qiangqiang He,Jie Zhang,Shuwei Qian,Chongjun Wang
DOI: https://doi.org/10.1109/icip51287.2024.10647294
2024-01-01
Abstract:Visual dialog involves answering questions by analyzing both images and dialogue history. While current multimodal research has effectively modeled the interactions among images, dialogue history, and questions, it incurs significant computational overhead and complexity. To address these challenges, this paper introduces a MultiModal Star Transformer (MMST) that effectively models the interactions between visual and textual modalities, as well as within each modality, with linear computational overhead. MMST utilizes a relay token for each modality, allowing each satellite token to interact with its two adjacent tokens, its previous state, and the two relay tokens. The introduction of relay tokens ensures that every two non-adjacent satellite tokens are two-hop neighbors, thus enabling MMST to support both intramodal long-range connections and intermodal interactions efficiently. Experimental results on the Visdial v0.9 and v1.0 datasets demonstrate that MMST performs comparably to full-attention models.
What problem does this paper attempt to address?