Abstract:External knowledge has been widely applied in image captioning tasks to enrich the generated sentences. However, existing methods retrieve knowledge by considering only semantic relevance while ignoring whether they are useful for captioning. For example, when querying “person” in external knowledge, the most relevant concepts may be “wearing shirt” or “riding horse” statistically, which are not consistent with image contents and introduce noise to generated sentences. Intuitively, we humans can iteratively correlate visual clues with corresponding knowledge to distinguish useful clues from noise. Therefore, we propose an event-aware retrospective learning network for knowledge-based image captioning, which employs a retrospective validation mechanism on captioning models to align the retrieved knowledge with visual contents. This approach is an event-aware perspective and helps select useful knowledge that corresponds to visual facts. To better align images and knowledge, 1) we design an event-aware retrieval algorithm that clusters word-centered knowledge into triplet-centered knowledge (i.e., from “< subject - predicate - object >” to “< triplet A> - edge - < triplet B >”, which provides an event context to facilitate knowledge retrieval and validation. 2) We revisit image contents to retrospectively validate retrieved knowledge by aligning the visual representation between knowledge and image. We summarize the visual characteristics of each knowledge event from the visual genome dataset to help learn which knowledge does not exist in the visual scene and should be discarded. 3) We adopt a dynamic knowledge fusion module that calibrates image and knowledge representations for sentence generation, which includes a knowledge-controlled gate unit that jointly calculates visual and semantic features in event-aware patterns. Compared to current knowledge-based captioning methods, the proposed network retrospectively learns the visual facts by event-aware retrieval and knowledge-image visual alignment, which regularizes the knowledge-incorporated captioning with visual evidence. Extensive experiments on the MS-COCO dataset demonstrate the effectiveness of our method. Ablation studies and visualization demonstrate the advantages of each component of the proposed model.

Event Representation Learning Enhanced with External Commonsense Knowledge

Event causality extraction through external event knowledge learning and polyhedral word embedding

Improving Event Causality Identification via Self-Supervised Representation Learning on External Causal Statement

Modeling Event-Pair Relations in External Knowledge Graphs for Script Reasoning.

CoolGust: knowledge representation learning with commonsense knowledge guidelines and constraints

Multi-level Connection Enhanced Representation Learning for Script Event Prediction

Embedding and Predicting the Event at Early Stage

EntailE: Introducing Textual Entailment in Commonsense Knowledge Graph Completion

Enhancing Video Event Recognition Using Automatically Constructed Semantic-Visual Knowledge Base.

Event-Driven Learning of Systematic Behaviours in Stock Markets

Pingan Smart Health and SJTU at COIN - Shared Task: Utilizing Pre-trained Language Models and Common-sense Knowledge in Machine Reading Tasks

EventKGE: Event knowledge graph embedding with event causal transfer

A Multi-View Representation Learning Framework for Commonsense Knowledge Bases

Plausible-Parrots @ MSP2023: Enhancing Semantic Plausibility Modeling using Entity and Event Knowledge

Event2Mind: Commonsense Inference on Events, Intents, and Reactions

Knowledge-Based Topic Model for Multi-Modal Social Event Analysis

LearnDA: Learnable Knowledge-Guided Data Augmentation for Event Causality Identification

Visually Grounded Commonsense Knowledge Acquisition

EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning

Event-aware Retrospective Learning for Knowledge-based Image Captioning

ONSEP: A Novel Online Neural-Symbolic Framework for Event Prediction Based on Large Language Model