Multimodal Cross-Document Event Coreference Resolution Using Linear Semantic Transfer and Mixed-Modality Ensembles

Abhijnan Nath,Huma Jamil,Shafiuddin Rehan Ahmed,George Baker,Rahul Ghosh,James H. Martin,Nathaniel Blanchard,Nikhil Krishnaswamy
2024-04-13
Abstract:Event coreference resolution (ECR) is the task of determining whether distinct mentions of events within a multi-document corpus are actually linked to the same underlying occurrence. Images of the events can help facilitate resolution when language is ambiguous. Here, we propose a multimodal cross-document event coreference resolution method that integrates visual and textual cues with a simple linear map between vision and language models. As existing ECR benchmark datasets rarely provide images for all event mentions, we augment the popular ECB+ dataset with event-centric images scraped from the internet and generated using image diffusion models. We establish three methods that incorporate images and text for coreference: 1) a standard fused model with finetuning, 2) a novel linear mapping method without finetuning and 3) an ensembling approach based on splitting mention pairs by semantic and discourse-level difficulty. We evaluate on 2 datasets: the augmented ECB+, and AIDA Phase 1. Our ensemble systems using cross-modal linear mapping establish an upper limit (91.9 CoNLL F1) on ECB+ ECR performance given the preprocessing assumptions used, and establish a novel baseline on AIDA Phase 1. Our results demonstrate the utility of multimodal information in ECR for certain challenging coreference problems, and highlight a need for more multimodal resources in the coreference resolution space.
Computation and Language
What problem does this paper attempt to address?
The paper primarily addresses the problem of Cross-Document Event Coreference Resolution (ECR), particularly the challenges faced when dealing with articles from different sources and different expressions that describe the same event. Specifically, the paper proposes a method that utilizes multimodal information (text and images) to improve the accuracy of event coreference resolution. The key contributions of the paper include: 1. **Proposing a new multimodal cross-document event coreference resolution method**: This method combines visual and textual cues and establishes a connection between visual and language models through a simple linear mapping. 2. **Enhancing existing datasets**: Since existing ECR benchmark datasets rarely provide images for all event mentions, the authors enhanced the ECB+ dataset by web scraping and using image diffusion models. 3. **Three different multimodal coreference methods**: These include a standard fusion model, a novel linear mapping method, and a model integration method based on semantic and discourse difficulty classification. 4. **Evaluation results**: Evaluations were conducted on the ECB+ and AIDA Phase 1 datasets, achieving a CoNLL F1 score of 91.9 on ECB+ and establishing a new baseline on the AIDA Phase 1 dataset. Through these methods, the paper demonstrates the effectiveness of multimodal information in addressing some challenging event coreference problems and emphasizes the need for more multimodal resources to advance the field of coreference resolution.