Abstract:Objective Content-based image retrieval uses features extracted from an image to retrieve similar images accurately and with low memory and time consumption from a large-scale dataset.Scale-invariant feature transform (SIFT) is robust to translation,scaling,rotation,viewpoint changing,and occlusion,as well as performs fast extraction.Thus,SIFT is widely used theoretically and practically.However,SIFT has some shortcomings,such as a lack of spatial geometric information and color information.Convolutional neural network (CNN) has good domain transferability,and deep features from pre-trained CNN can be applied to various domains.CNN deep features have recently attracted considerable attention and exhibit superior performance over SIFT.However,contrary to the shortcoming of SIFT,CNN features lack shallow information.Thus,SIFT is usually fused with CNN features and other shallow features.Method This report reviews the recent advances and challenges in image retrieval in the world and in China,including shallow feature,deep feature,and feature fusion.Future development trends are also explored.For shallow features,we mainly review SIFT and its variants,the encoding methods,and the development of these methods.For deep features,we divide the descriptors of the features into different categories according to the type of CNN layer that was used:fully connected layer,convolutional layer,and softmax layer.Many features can be extracted from a convolutional layer,and many pooling methods are proposed.Result The encoding methods of SIFT mainly include bag of features (BOF),vector of locally aggregated vectors (VLAD),Fisher vector (FV),and triangulation embedding (TE),and they mostly consist of two steps:embedding and aggregation (or pooling).For CNN features,features from the fully connected layer of CNN are typically used because of their good transferability and accuracy.However,deep features from the convolutional layer have become an increasingly attractive option recently because the convolutional features can be effectively combined with a variety of pooling methods such as sum-pooling,max-pooling,VLAD-pooling,and FV-pooling,and they perform well in the domains of image classification and retrieval.The fusion methods can mainly be divided into five types:concatenation,kernel fusion,graph fusion,index-level fusion,and score-level fusion.Concatenation,kernel fusion,and index-level fusion work directly on different features,and graph fusion and score-level fusion work on the retrieval results of different features.Fusion uses complementary different features and can improve image retrieval accuracy effectively.Conclusion SIFT and CNN feature are complementary to each other:SIFT contains rich low-level information,and CNN features contain rich high semantic information;SIFT has a good property of invariance,which is the shortcoming of CNN features.Fusion is an effective way to maximize image information.However,time and space consumption will inevitably increase,and a good algorithm that can be used to distinguish good features from bad ones is yet to be studied.At present,the generalizability and geometric invariance of CNN features are inferior to those of SIFT;this issue continues to be a challenge for image retrieval researchers.The generalizability of CNN features is limited by the domain and statistic difference between the source task (usually ImageNet) and the target task.Fine tuning is a good strategy to solve this problem;however,this approach needs an additional labeled dataset similar to the target task.To enhance the geometric invariance of CNN,the CNN descriptor space consumption and extraction time will inevitably increase,and only scale invariance is usually considered for simplicity,ignoring other aspects of invariance.Moreover,the number of CNN features from one image is usually much smaller than that of SIFT;thus,insufficient information for encoding will be captured.The most commonly used CNNs are designed for image classification tasks and not for image retrieval.However,image retrieval is a more fine-grained domain;a relevant algorithm needs to find similar images,not just the images from one class.Thus,a CNN trained for image retrieval may be a good future research direction.More work is still needed to strike a better balance among generalizability,invariance,memory consumption,and extraction time for an effective and efficient image retrieval descriptor.

CNN Vs. SIFT for Image Retrieval: Alternative or Complementary?

SIFT-Based Image Retrieval Combining the Distance Measure of Global Image and Sub-Image

Exploring Geometric Information in CNN for Image Retrieval.

Aggregating Hierarchical Binary Activations for Image Retrieval

Retrieval Oriented Deep Feature Learning With Complementary Supervision Mining.

Multimedia technology 2016: advances and trends in image retrieval

SIFT Meets CNN: A Decade Survey of Instance Retrieval

A Comparative Study of SIFT and Its Variants

Collaborative Index Embedding for Image Retrieval

Research on image feature extraction and retrieval algorithms based on convolutional neural network

A Novel Cnn-Based Match Kernel For Image Retrieval

Descriptor Matching with Convolutional Neural Networks: a Comparison to SIFT

Packing and Padding: Coupled Multi-index for Accurate Image Retrieval

Feature Fusion for Image Retrieval with Adaptive Bitrate Allocation and Hard Negative Mining.

What Is the Best Practice for CNNs Applied to Visual Instance Retrieval?

An Ensemble of Complementary Models for Deep Tracking

Attention Model Based SIFT Keypoints Filtration for Image Retrieval

Adaptive multi-feature fusion via cross-entropy normalization for effective image retrieval

SIFT-Based Image Compression

Accurate Image Search with Multi-Scale Contextual Evidences

High-Resolution Remote Sensing Image Retrieval Based on CNNs from a Dimensional Perspective.