Abstract:The MPEG compact descriptors for visual search (CDVS) is a standard toward image matching and retrieval. To achieve high retrieval accuracy over a large scale image/video dataset, recent research efforts have demonstrated that employing extremely high-dimensional descriptors such as the Fisher vector (FV) and the vector of locally aggregated descriptors (VLAD) can yield good performance. Since the FV (or VLAD) possesses high discriminability but small visual vocabulary, it has been adopted by CDVS to construct a global compact descriptor. In this paper, we study the development of global compact descriptors in the completed CDVS standard and the emerging compact descriptors for video analysis (CDVA) standard, in which we formulate the FV (or VLAD) compression as a resource-constrained optimization problem. Accordingly, we propose a codebook-free aggregation method via dual selection to generate a global compact visual descriptor, which supports fast and accurate feature matching free of large visual codebooks, fulfilling the low memory requirement of mobile visual search at significantly reduced latency. Specifically, we investigate both sample-specific Gaussian component redundancy and bit dependency within a binary aggregated descriptor to produce compact binary codes. Our technique contributes to the scalable compressed Fisher vector (SCFV) adopted by the CDVS standard. Moreover, the SCFV descriptor is currently serving as the frame-level hand-crafted video feature, which inspires the inheritance of CDVS descriptors for the emerging CDVA standard. Furthermore, we investigate the positive complementary effect of our standard compliant compact descriptor and deep learning based features extracted from convolutional neural networks with significant mean average precision gains. Extensive evaluation over benchmark databases shows the significant merits of the codebook-free binary codes for scalable visual search.

Selectively Aggregated Fisher Vectors of Query Video for Mobile Visual Search

Feature Based Inter Prediction Optimization for Non-Translational Video Coding in Cloud

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

Action Recognition with Stacked Fisher Vectors.

VRFP: On-the-fly Video Retrieval using Web Images and Fast Fisher Vector Products

A Compact Binary Aggregated Descriptor Via Dual Selection for Visual Search

Video retrieval using VQ-based global motion features

Scan Without a Glance: Towards Content-Free Crowd-Sourced Mobile Video Retrieval System

Spatial Weighted Fisher Vector for Image Retrieval

Mobile Visual Search Compression with Grassmann Manifold Embedding

Depth-based Local Feature Selection for Mobile Visual Search

Optimizing Binary Fisher Codes for Visual Search

Data-Driven Lightweight Interest Point Selection for Large-Scale Visual Search

Codebook-Free Compact Descriptor for Scalable Visual Search.

Conditional deep clustering based transformed spatio-temporal features and fused distance for efficient video retrieval

Fine-grained Text-Video Retrieval with Frozen Image Encoders

Listen, look, and gotcha: instant video search with mobile phones by layered audio-video indexing.

Temporal Feature Aggregation for Efficient 2D Video Grounding

GPU-FV: Realtime Fisher Vector and Its Applications in Video Monitoring

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

A novel video thumbnail extraction method using spatiotemporal vector quantization.