Abstract:Word embeddings, i.e., semantically meaningful vector representation of words, are largely influenced by the distributional hypothesis "You shall know a word by the company it keeps" (Harris, 1954), whereas modern prediction-based neural network embeddings rely on design choices and hyperparameter optimization. Word embeddings like Word2Vec, GloVe etc. well capture the contextuality and real-world analogies but contemporary convolution-based image embeddings such as VGGNet, AlexNet, etc. do not capture contextual knowledge. The popular king-queen analogy does not hold true for most commonly used vision embeddings. In this paper, we introduce a pre-trained joint embedding (JE), named IMAGINATOR, trained on 21K distinct image objects level from 1M image+text pairs. JE is a way to encode multimodal data into a vector space where the text modality serves as the ground-ing key, which the complementary modality (in this case, the image) is anchored with. IMAGINATOR encapsulates three individual representations: (i) object-object co-location, (ii) word-object co-location, and (iii) word-object correlation. These three ways capture complementary aspects of the two modalities which are further combined to obtain the final JEs. Generated JEs are intrinsically evaluated to assess how well they capture the contextuality and real-world analogies. We also evaluate pre-trained IMAGINATOR JEs on three downstream tasks: (i) image captioning, (ii) Image2Tweet, and (iii) text-based image retrieval. IMAGINATOR establishes a new standard on the aforementioned down-stream tasks by outperforming the current SoTA on all the selected tasks. IMAGINATOR will be made publicly available. The codes are available at <a class="link-external link-https" href="https://github.com/varunakk/IMAGINATOR" rel="external noopener nofollow">this https URL</a>

Unveiling the Dreams of Word Embeddings: Towards Language-Driven Image Generation

Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge.

Visual Word2Vec (vis-w2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes

IMAGINATOR: Pre-Trained Image+Text Joint Embeddings using Word-Level Grounding of Images

BrainDreamer: Reasoning-Coherent and Controllable Image Generation from EEG Brain Signals via Language Guidance

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

3D-Aware Image Synthesis Via Learning Structural and Textural Representations

Generative View Synthesis: From Single-view Semantics to Novel-view Images

Exploration on Grounded Word Embedding: Matching Words and Images with Image-Enhanced Skip-Gram Model

DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Positive-Negative Prompt-Tuning

AI Illustrator: Translating Raw Descriptions into Images by Prompt-based Cross-Modal Generation

Learning to Imagine: Visually-Augmented Natural Language Generation

Deep Learning for Image-to-Text Generation: A Technical Overview

Seeing the advantage: visually grounding word embeddings to better capture human semantic knowledge

Towards Semantic Embedding In Visual Vocabulary

Unveiling Spaces: Architecturally meaningful semantic descriptions from images of interior spaces

HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation

Picture It In Your Mind: Generating High Level Visual Representations From Textual Descriptions

Learning semantic sentence representations from visually grounded language without lexical knowledge

DreamCatcher: Revealing the Language of the Brain with fMRI using GPT Embedding

Learning Multi-Modal Word Representation Grounded in Visual Context