Abstract:Bag-of-visual Words (BoWs) representation has been applied for various problems in the fields of multimedia and computer vision. The basic idea is to represent images as visual documents composed of repeatable and distinctive visual elements, which are comparable to the text words. Notwithstanding its great success and wide adoption, visual vocabulary created from single-image local descriptors is often shown to be not as effective as desired. In this paper, descriptive visual words (DVWs) and descriptive visual phrases (DVPs) are proposed as the visual correspondences to text words and phrases, where visual phrases refer to the frequently co-occurring visual word pairs. Since images are the carriers of visual objects and scenes, a descriptive visual element set can be composed by the visual words and their combinations which are effective in representing certain visual objects or scenes. Based on this idea, a general framework is proposed for generating DVWs and DVPs for image applications. In a large-scale image database containing 1506 object and scene categories, the visual words and visual word pairs descriptive to certain objects or scenes are identified and collected as the DVWs and DVPs. Experiments show that the DVWs and DVPs are informative and descriptive and, thus, are more comparable with the text words than the classic visual words. We apply the identified DVWs and DVPs in several applications including large-scale near-duplicated image retrieval, image search re-ranking, and object recognition. The combination of DVW and DVP performs better than the state of the art in large-scale near-duplicated image retrieval in terms of accuracy, efficiency and memory consumption. The proposed image search re-ranking algorithm: DWPRank outperforms the state-of-the-art algorithm by 12.4% in mean average precision and about 11 times faster in efficiency.

Representing Word Image Using Visual Word Embeddings And Rnn For Keyword Spotting On Historical Document Images

Word Image Representation Based On Visual Embeddings And Spatial Constraints For Keyword Spotting On Historical Documents

Integrating Visual Word Embeddings into Translation Language Model for Keyword Spotting on Historical Mongolian Document Images

Deep Features Representation of Word Image for Keyword Spotting in Historical Mongolian Document Images

LDA-Based Word Image Representation for Keyword Spotting on Historical Mongolian Documents

A Hybrid Representation of Word Images for Keyword Spotting.

A Case Study of Bovw for Keyword Spotting on Historical Mongolian Document Images

Word and Document Embeddings based on Neural Network Approaches

Training and Evaluating Multimodal Word Embeddings with Large-scale Web Annotated Images

Embedded Representation of Relation Words with Visual Supervision

Visual Exploration and Comparison of Word Embeddings.

Predicting Visual Features from Text for Image and Video Caption Retrieval

Spatial Encoding of Visual Words for Image Classification.

Word Image Representation Based on Sequence to Sequence Model with Attention Mechanism for Out-of-Vocabulary Keyword Spotting.

Generating descriptive visual words and visual phrases for large-scale image applications

Towards Semantic Embedding In Visual Vocabulary

Expressing Objects just like Words: Recurrent Visual Embedding for Image-Text Matching

Word2VisualVec: Cross-Media Retrieval by Visual Feature Prediction.

SpottingNet: Learning the Similarity of Word Images with Convolutional Neural Network for Word Spotting in Handwritten Historical Documents

Accurate Word Representations with Universal Visual Guidance

Word2VisualVec: Image and Video to Sentence Matching by Visual Feature Prediction