Interpretable Multimodal Capsule Fusion

Jianfeng Wu,Sijie Mai,Haifeng Hu
DOI: https://doi.org/10.1109/taslp.2022.3178236
2022-01-01
IEEE/ACM Transactions on Audio Speech and Language Processing
Abstract:With the development of social networking platform, multimodal sentiment analysis has become increasingly prominent. Existing models focus on capturing intramodal and intermodal interactions to produce effective modality representations. However, they overlook the study of interpretability which reveals how modalities interact with each other and which modality contributes most to the final prediction. In this paper, we propose an interpretable model called Interpretable Multimodal Capsule Fusion (IMCF) which integrates routing mechanism of Capsule Network (CapsNet) and Long Short-Term Memory (LSTM) to produce refined modality representations and provide interpretation. By constructing features of different modalities into input sequence, we are able to obtain highly expressive representation of intermodal dynamics due to the strong ability of LSTM to produce representation of sequence. As routing mechanism is applied during modality fusion and prediction stages, the value of routing coefficient can reveal the contributions of different modalities or dynamics, which provides interpretation. Meanwhile, routing mechanism can iteratively adjust the information flows of different modalities, which makes the process of modality fusion more reasonable. The experimental results show that our model achieves competitive performance on two benchmark datasets with effective modality fusion by LSTM and interpretation provided by routing mechanism.
What problem does this paper attempt to address?