Spatial-Temporal Aligned Multi-Agent Learning for Visual Dialog Systems

Yong Zhuang,Tong Yu,Junda Wu,Shiqu Wu,Shuai Li
DOI: https://doi.org/10.1145/3503161.3548345
2022-01-01
Abstract:Existing interactive learning systems usually train models on simulators as surrogates for real users. Due to the limited amount of user data, trained simulators may lead to biased results as it fails to well represent real users. One solution is to model users as agents, and then simultaneously train the interactive system and user agents by multi-agent reinforcement learning (MARL) frameworks. However, developing efficient MARL frameworks for modern interactive multimodal systems is still challenging. First, given the existence of multimodal data, how to develop accurate multimodal fusion within and between agents in each interaction is challenging and unclear. Second, interactions between users and systems are complex and it is challenging to track and synchronize the interactions over time. The above multimodal fusion between agents and synchronization over time becomes even more challenging, when the amount of user data is limited. To jointly address these challenges and achieve more sample-efficient learning, we propose a novel spatial-temporal aligned (STA) multi-agent reinforcement learning framework to better align the multimodal data within and between agents over time. Based on our framework, we develop sample-efficient visual dialog systems. Through extensive experiments and analysis, we validate the effectiveness of our spatial-temporal aligned (STA) multi-agent reinforcement learning framework in visual dialog systems.
What problem does this paper attempt to address?