Abstract:Advances in deep learning have enabled accurate language-based search and retrieval, e.g., over user photos, in the cloud. Many users prefer to store their photos in the home due to privacy concerns. As such, a need arises for models that can perform cross-modal search on resource-limited devices. State-of-the-art cross-modal retrieval models achieve high accuracy through learning entangled representations that enable fine-grained similarity calculation between a language query and an image, but at the expense of having a prohibitively high retrieval latency. Alternatively, there is a new class of methods that exhibits good performance with low latency, but requires a lot more computational resources, and an order of magnitude more training data (i.e. large web-scraped datasets consisting of millions of image-caption pairs) making them infeasible to use in a commercial context. From a pragmatic perspective, none of the existing methods are suitable for developing commercial applications for low-latency cross-modal retrieval on low-resource devices. We propose CrispSearch, a cascaded approach that greatly reduces the retrieval latency with minimal loss in ranking accuracy for on-device language-based image retrieval. The idea behind our approach is to combine a light-weight and runtime-efficient coarse model with a fine re-ranking stage. Given a language query, the coarse model effectively filters out many of the irrelevant image candidates. After this filtering, only a handful of strong candidates will be selected and sent to a fine model for re-ranking. Extensive experimental results with two SOTA models for the fine re-ranking stage, on standard benchmark datasets show that CrispSearch results in a speedup of up to 38 times over the SOTA fine methods with negligible performance degradation. Moreover, our method does not require millions of training instances, making it a pragmatic solution to on-device search and retrieval.

Efficient cross-modal retrieval using social tag information towards mobile applications

Learning Salient Visual Word for Scalable Mobile Image Retrieval.

Multi-modal Tag Localization for Mobile Video Search.

Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training

An Efficient Cross-Modal Privacy-Preserving Image–Text Retrieval Scheme

Cross-Modal Pre-Aligned Method with Global and Local Information for Remote-Sensing Image and Text Retrieval

Cross-Modal Hashing Retrieval with Compatible Triplet Representation

Scalable Mobile Image Retrieval by Exploring Contextual Saliency

Improving Cross-Modal Image-Text Retrieval With Teacher-Student Learning

An Efficient Approach for Geo-Multimedia Cross-Modal Retrieval

Realizing Efficient On-Device Language-based Image Retrieval

Cross-Modal Image-Tag Relevance Learning for Social Images

Mobile Visual Search Compression with Grassmann Manifold Embedding

A Unified Optimal Transport Framework for Cross-Modal Retrieval with Noisy Labels

MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

Toward Intelligent Visual Sensing and Low-cost Analysis: A Collaborative Computing Approach

Training dataset Preprocessing SIFT descriptor SMPT Grassman Pruning Entropy encoder SIFT descriptor Training Bit Stream

Integrating Multisubspace Joint Learning With Multilevel Guidance for Cross-Modal Retrieval of Remote Sensing Images

Context Aware Information Delivery for Mobile Devices.

Triplet-Based Deep Hashing Network for Cross-Modal Retrieval