Abstract:Large-scale commonsense knowledge bases empower a broad range of AI applications, where the automatic extraction of commonsense knowledge (CKE) is a fundamental and challenging problem. CKE from text is known for suffering from the inherent sparsity and reporting bias of commonsense in text. Visual perception, on the other hand, contains rich commonsense knowledge about real-world entities, e.g., (person, can_hold, bottle), which can serve as promising sources for acquiring grounded commonsense knowledge. In this work, we present CLEVER, which formulates CKE as a distantly supervised multi-instance learning problem, where models learn to summarize commonsense relations from a bag of images about an entity pair without any human annotation on image instances. To address the problem, CLEVER leverages vision-language pre-training models for deep understanding of each image in the bag, and selects informative instances from the bag to summarize commonsense entity relations via a novel contrastive attention mechanism. Comprehensive experimental results in held-out and human evaluation show that CLEVER can extract commonsense knowledge in promising quality, outperforming pre-trained language model-based methods by 3.9 AUC and 6.4 mAUC points. The predicted commonsense scores show strong correlation with human judgment with a 0.78 Spearman coefficient. Moreover, the extracted commonsense can also be grounded into images with reasonable interpretability. The data and codes can be obtained at <a class="link-external link-https" href="https://github.com/thunlp/CLEVER" rel="external noopener nofollow">this https URL</a>.

Know2Look: Commonsense Knowledge for Visual Search

OCTOPUS: aggressive search of multi-modality data using multifaceted knowledge base.

How a General-Purpose Commonsense Ontology can Improve Performance of Learning-Based Image Retrieval

Visually Grounded Commonsense Knowledge Acquisition

COFAR: Commonsense and Factual Reasoning in Image Search

Cross Domain Search by Exploiting Wikipedia.

Enhancing Image Retrieval : A Comprehensive Study on Photo Search using the CLIP Mode

What Looks Good with my Sofa: Multimodal Search Engine for Interior Design

DeepSeek: Content Based Image Search & Retrieval

Axiomatic Explanations for Visual Search, Retrieval, and Similarity Learning

Known-Item Search by MCG-ICT-CAS.

Knowledge Graph Based Visual Search Application

PhotoScout: Synthesis-Powered Multi-Modal Image Search

Commonsense Properties from Query Logs and Question Answering Forums

Visual query suggestion: Towards capturing user intent in internet image search

Use What You Have: Video Retrieval Using Representations from Collaborative Experts.

Learn and Search: An Elegant Technique for Object Lookup using Contrastive Learning

Capturing, Documenting and Visualizing Search Contexts for building Multimedia Corpora

Beyond visual semantics: Exploring the role of scene text in image understanding

Multi-Level Knowledge Injecting for Visual Commonsense Reasoning

Cross-modal Retrieval for Knowledge-based Visual Question Answering