Type-Specific Modality Alignment for Multi-Modal Information Extraction

Shaowei Chen,Shuaipeng Liu,Jie Liu
DOI: https://doi.org/10.1109/lsp.2024.3396705
2024-06-12
IEEE Signal Processing Letters
Abstract:Multi-modal information extraction aims to identify structured information, such as entities or relations between entities, from text with the help of visual clues. Although existing studies have achieved great progress, they mainly focused on modality interactions in the global space while neglecting fine-grained modality alignments under the semantic subspace specific to each entity type or relation type. To solve this problem, we propose a multi-space modality alignment method (MSMA) in this letter. The core of our model is a type-specific modality interaction module (TMI), which constructs a unique semantic subspace for each entity/relation type and independently performs type-specific modality alignments under each subspace. To enable mutual promotion between different types, a global modality integration module (GMI) is designed to learn the associations between different subspaces. Furthermore, we execute these two modules iteratively for high-level semantic fusion. Extensive experiments on three benchmark datasets show that our model significantly outperforms advanced methods.
engineering, electrical & electronic
What problem does this paper attempt to address?