Abstract:Bag-of-visual Words (BoW) image representation has been illustrated as one of the most promising solutions for large-scale near-duplicated image retrieval. However, the traditional visual vocabulary is created in an unsupervised way by clustering a large number of image local features. This is not ideal because it largely ignores the semantic and spatial contexts between local features. In this paper, we propose the geometric visual vocabulary which captures the spatial contexts by quantizing local features in bi-space, i.e., in descriptor space and orientation space. Then, we propose to capture the semantic context by learning a semantic-aware distance metric between local features, which could reasonably measure the semantic similarities between image patches, from which the local features are extracted. The learned distance is hence utilized to cluster the local features for semantic visual vocabulary generation. Finally, we combine the spatial and semantic contexts in a unified framework by extracting local feature groups, computing the spatial configurations between the local features inside the group, and learning a semantic-aware distance between groups. The learned group distance is then utilized to cluster the extracted local feature groups to generate a novel visual vocabulary, i.e., the contextual visual vocabulary. The proposed visual vocabularies, i.e., geometric visual vocabulary, semantic visual vocabulary and contextual visual vocabulary are tested in large-scale near-duplicated image retrieval applications. The geometric visual vocabulary and semantic visual vocabulary achieve better performance than the traditional visual vocabulary. Moreover, the contextual visual vocabulary, which combines both spatial and semantic clues outperforms the state-of-the-art bundled feature in both retrieval precision and efficiency.

Towards Semantic Embedding In Visual Vocabulary

A Semantic-Based Method for Visualizing Large Image Collections.

Deep Visual Semantic Embedding with Text Data Augmentation and Word Embedding Initialization

Modeling Image Data for Effective Indexing and Retrieval in Large General Image Databases.

Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations

Learning Structured Semantic Embeddings for Visual Recognition

Learning semantic sentence representations from visually grounded language without lexical knowledge

Dynamic Visual Semantic Sub-Embeddings and Fast Re-Ranking

Learning Robust Visual-Semantic Embeddings

Learning Semantic Feature Map for Visual Content Recognition

Modeling spatial and semantic cues for large-scale near-duplicated image retrieval

Visual Word2Vec (vis-w2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes

Discovering Visual Concept Structure with Sparse and Incomplete Tags

Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning

Visual Vocabulary Optimization with Spatial Context for Image Annotation and Classification

Semantic Visualization for Short Texts with Word Embeddings

Building Descriptive and Discriminative Visual Codebook for Large-Scale Image Applications.

Image Tagging Via Cross-Modal Semantic Mapping

Refining local descriptors by embedding semantic information for visual categorization.

Semantic-Aware Fine-Grained Correspondence

Integration of Semantic and Visual Hashing for Image Retrieval