Multimodal Relational Triple Extraction with Query-based Entity Object Transformer

Lei Hei,Ning An,Tingjing Liao,Qi Ma,Jiaqi Wang,Feiliang Ren
2024-08-16
Abstract:Multimodal Relation Extraction is crucial for constructing flexible and realistic knowledge graphs. Recent studies focus on extracting the relation type with entity pairs present in different modalities, such as one entity in the text and another in the image. However, existing approaches require entities and objects given beforehand, which is costly and impractical. To address the limitation, we propose a novel task, Multimodal Entity-Object Relational Triple Extraction, which aims to extract all triples (entity span, relation, object region) from image-text pairs. To facilitate this study, we modified a multimodal relation extraction dataset MORE, which includes 21 relation types, to create a new dataset containing 20,264 triples, averaging 5.75 triples per image-text pair. Moreover, we propose QEOT, a query-based model with a selective attention mechanism, to dynamically explore the interaction and fusion of textual and visual information. In particular, the proposed method can simultaneously accomplish entity extraction, relation classification, and object detection with a set of queries. Our method is suitable for downstream applications and reduces error accumulation due to the pipeline-style approaches. Extensive experimental results demonstrate that our proposed method outperforms the existing baselines by 8.06% and achieves state-of-the-art performance.
Information Retrieval
What problem does this paper attempt to address?
The paper attempts to address the following issues: 1. **Multimodal Entity-Object Relational Triple Extraction**: Existing methods for multimodal relation extraction require pre-specified entity pairs, which is both expensive and impractical in real-world applications. Therefore, the paper proposes a new task, which is to extract all possible triple forms (entity span, relation type, object region) from image-text pairs. Specifically, this task not only requires identifying entities in the text and objects in the image but also predicting the relationships between them. 2. **End-to-End Approach**: Existing methods typically adopt a pipeline approach, performing entity extraction, relation classification, and object detection separately, leading to error accumulation. The paper proposes an end-to-end approach that can accomplish these three tasks simultaneously, reducing error accumulation. 3. **New Dataset**: To facilitate research on this new task, the paper modifies the existing multimodal relation extraction dataset MORE, creating a new dataset MORTE, which contains 20,264 triples, with an average of 5.75 triples per image-text pair. Through these improvements, the paper proposes a new model QEOT (Query-based Entity-Object Transformer), which employs a selective attention mechanism and a gated fusion mechanism to dynamically explore the interaction and fusion of textual and visual information. Experimental results show that this method significantly outperforms existing baseline methods on multiple metrics.