Abstract:As one of the classic tasks in information retrieval, the core of image retrieval is to identify the images sharing similar features with a query image, aiming to enable users to find the required information from a large number of images conveniently. Street view image retrieval, in particular, finds extensive applications in many fields, such as improvements to navigation and mapping services, formulation of urban development planning scheme, and analysis of historical evolution of buildings. However, the intricate foreground and background details in street view images, coupled with a lack of attribute annotations, render it among the most challenging issues in practical applications. Current image retrieval research mainly uses the visual model that is completely dependent on the image visual features, and the multimodal learning model that necessitates additional data sources (e.g., annotated text). Yet, creating annotated datasets is expensive, and street view images, which contain a large amount of scene texts themselves, are often unannotated. Therefore, this paper proposes a deep unsupervised learning algorithm that combines visual and text features from image data for improving the accuracy of street view image retrieval. Specifically, we employ text detection algorithms to identify scene text, utilize the Pyramidal Histogram of Characters encoding predictor model to extract text information from images, deploy deep convolutional neural networks for visual feature extraction, and incorporate a contrastive learning module for image retrieval. Upon testing across three street view image datasets, the results demonstrate that our model holds certain advantages over the state‐of‐the‐art multimodal models pre‐trained on extensive datasets, characterized by fewer parameters and lower floating point operations. Code and data are available at https://github.com/nwuSY/svtRetrieval.

Visual Matching is Enough for Scene Text Retrieval.

Visual and semantic guided scene text retrieval

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval

Scene-text aware cross-modal retrieval based on semantic matching (ChinaMM2024)

Image Retrieval for Visual Localization via Scene Text Detection and Logo Filtering

Partial Scene Text Retrieval

HSGMP: Heterogeneous Scene Graph Message Passing for Cross-modal Retrieval

Visual Semantic Reasoning for Image-Text Matching

Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training

Visual-Semantic Matching by Exploring High-Order Attention and Distraction

Unambiguous Scene Text Segmentation with Referring Expression Comprehension

Text-based Person Search in Full Images via Semantic-Driven Proposal Generation

Multimodal learning with only image data: A deep unsupervised model for street view image retrieval by fusing visual and scene text features of images

ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval

Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing

StacMR: Scene-Text Aware Cross-Modal Retrieval

Multi-oriented Scene Text Detection via Corner Localization and Region Segmentation

Visual context learning based on textual knowledge for image-text retrieval

Beyond visual semantics: Exploring the role of scene text in image understanding

Text-Vision Relationship Alignment for Referring Image Segmentation

Image-text Retrieval via Preserving Main Semantics of Vision