MMAN-M2: Multiple Multi-head Attentions Network based on Encoder with Missing Modalities

Jiayao Li,Li Li,Ruizhi Sun,Gang Yuan,Shufan Wang,Shulin Sun
DOI: https://doi.org/10.1016/j.patrec.2023.11.029
IF: 4.757
2023-12-04
Pattern Recognition Letters
Abstract:Multi-modal fusion is a hot topic in field of multi-modal learning. Most of the previous multi-modal fusion tasks are based on the complete modality. Existing researches on missing multi-modal fusion fail to consider the random missing of modalities, thereby lacking robustness. And most of methods are based on the correlation between missing and non-missing modalities, ignoring missing modalities contextual information. Considering the above two issues, we designed a multiple multi-head attentions network based on encoder with missing modalities (MMAN-M2). Firstly, the multi-head attention network is used to represent the single modality by extracting potential features based on the entire sequence, and then they are fused; Then, the missing modality context features are extracted by optimizing the result of multi-modal fusion including missing and non-missing features data, and the missing modalities are encoded through the encoding module; Finally, the Transformer encoder-decoder module is used to train the network model by mapping obtaining global information to multiple spaces and integrating our uncertain multi-modal encoding, and it realizes the classification of multi-modal fusion for evaluating model performance. Extensive experiments on multi-modal public datasets show that the proposed method has the best effect and can effectively improve the classification performance of multi-modal fusion.
computer science, artificial intelligence
What problem does this paper attempt to address?