Abstract:With the recent boom of video-based social platforms (e.g., YouTube and TikTok), video retrieval using sentence queries has become an important demand and attracts increasing research attention. Despite the decent performance, existing text-video retrieval models in vision and language communities are impractical for large-scale Web search because they adopt brute-force search based on high-dimensional embeddings. To improve efficiency, Web search engines widely apply vector compression libraries (e.g., FAISS) to post-process the learned embeddings. Unfortunately, separate compression from feature encoding degrades the robustness of representations and incurs performance decay. To pursue a better balance between performance and efficiency, we propose the first quantized representation learning method for cross-view video retrieval, namely Hybrid Contrastive Quantization (HCQ). Specifically, HCQ learns both coarse-grained and fine-grained quantizations with transformers, which provide complementary understandings for texts and videos and preserve comprehensive semantic information. By performing Asymmetric-Quantized Contrastive Learning (AQ-CL) across views, HCQ aligns texts and videos at coarse-grained and multiple fine-grained levels. This hybrid-grained learning strategy serves as strong supervision on the cross-view video quantization model, where contrastive learning at different levels can be mutually promoted. Extensive experiments on three Web video benchmark datasets demonstrate that HCQ achieves competitive performance with state-of-the-art non-compressed retrieval methods while showing high efficiency in storage and computation. Code and configurations are available at <a class="link-external link-https" href="https://github.com/gimpong/WWW22-HCQ" rel="external noopener nofollow">this https URL</a>.

Joint Optimization of Multi-vector Representation with Product Quantization

Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance

ESPN: Memory-Efficient Multi-Vector Information Retrieval

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

Deep Product Quantization Module for Efficient Image Retrieval

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

Differentiable Optimized Product Quantization and Beyond

Matching-oriented Embedding Quantization for Ad-hoc Retrieval.

Beyond Product Quantization: Deep Progressive Quantization for Image Retrieval.

Mean-Removed Product Quantization for Large-scale Image Retrieval

Matching-oriented Product Quantization for Ad-hoc Retrieval

LibVQ: A Toolkit for Optimizing Vector Quantization and Efficient Neural Retrieval.

Scalable Image Retrieval by Sparse Product Quantization

Efficient Multi-Vector Dense Retrieval Using Bit Vectors

Progressive Similarity Preservation Learning for Deep Scalable Product Quantization

Collective Deep Quantization for Efficient Cross-Modal Retrieval.

Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval

Entropy-Optimized Deep Weighted Product Quantization for Image Retrieval

Multi-stage vector quantization towards low bit rate visual search

Understanding the Multi-vector Dense Retrieval Models

Orthonormal Product Quantization Network for Scalable Face Image Retrieval