Learning What and when to Drop

Feiyu Chen,Zhengxiao Sun,Deqiang Ouyang,Xueliang Liu,Jie Shao
DOI: https://doi.org/10.1145/3474085.3475661
2021-01-01
Abstract:Multi-sensory data has exhibited a clear advantage in expressing richer and more complex feelings, on the Emotion Recognition in Conversation (ERC) task. Yet, current methods for multimodal dynamics that aggregate modalities or employ additional modality-specific and modality-shared networks are still inadequate in balancing between the sufficiency of multimodal processing and the scalability to incremental multi-sensory data type additions. This incurs a bottleneck of performance improvement of ERC. To this end, we present MetaDrop, a differentiable and end-to-end approach for the ERC task that learns module-wise decisions across modalities and conversation flows simultaneously, which supports adaptive information sharing pattern and dynamic fusion paths. Our framework mitigates the problem of modelling complex multimodal relations while ensuring it enjoys good scalability to the number of modalities. Experiments on two popular multimodal ERC datasets show that MetaDrop achieves new state-of-the-art results.
What problem does this paper attempt to address?