Multimedia technology 2016: advances and trends in image retrieval
Junqing Yu,Zebin Wu,Fei Wu,Lifeng Sun
DOI: https://doi.org/10.11834/jig.170503
2017-01-01
Abstract:Objective Content-based image retrieval uses features extracted from an image to retrieve similar images accurately and with low memory and time consumption from a large-scale dataset.Scale-invariant feature transform (SIFT) is robust to translation,scaling,rotation,viewpoint changing,and occlusion,as well as performs fast extraction.Thus,SIFT is widely used theoretically and practically.However,SIFT has some shortcomings,such as a lack of spatial geometric information and color information.Convolutional neural network (CNN) has good domain transferability,and deep features from pre-trained CNN can be applied to various domains.CNN deep features have recently attracted considerable attention and exhibit superior performance over SIFT.However,contrary to the shortcoming of SIFT,CNN features lack shallow information.Thus,SIFT is usually fused with CNN features and other shallow features.Method This report reviews the recent advances and challenges in image retrieval in the world and in China,including shallow feature,deep feature,and feature fusion.Future development trends are also explored.For shallow features,we mainly review SIFT and its variants,the encoding methods,and the development of these methods.For deep features,we divide the descriptors of the features into different categories according to the type of CNN layer that was used:fully connected layer,convolutional layer,and softmax layer.Many features can be extracted from a convolutional layer,and many pooling methods are proposed.Result The encoding methods of SIFT mainly include bag of features (BOF),vector of locally aggregated vectors (VLAD),Fisher vector (FV),and triangulation embedding (TE),and they mostly consist of two steps:embedding and aggregation (or pooling).For CNN features,features from the fully connected layer of CNN are typically used because of their good transferability and accuracy.However,deep features from the convolutional layer have become an increasingly attractive option recently because the convolutional features can be effectively combined with a variety of pooling methods such as sum-pooling,max-pooling,VLAD-pooling,and FV-pooling,and they perform well in the domains of image classification and retrieval.The fusion methods can mainly be divided into five types:concatenation,kernel fusion,graph fusion,index-level fusion,and score-level fusion.Concatenation,kernel fusion,and index-level fusion work directly on different features,and graph fusion and score-level fusion work on the retrieval results of different features.Fusion uses complementary different features and can improve image retrieval accuracy effectively.Conclusion SIFT and CNN feature are complementary to each other:SIFT contains rich low-level information,and CNN features contain rich high semantic information;SIFT has a good property of invariance,which is the shortcoming of CNN features.Fusion is an effective way to maximize image information.However,time and space consumption will inevitably increase,and a good algorithm that can be used to distinguish good features from bad ones is yet to be studied.At present,the generalizability and geometric invariance of CNN features are inferior to those of SIFT;this issue continues to be a challenge for image retrieval researchers.The generalizability of CNN features is limited by the domain and statistic difference between the source task (usually ImageNet) and the target task.Fine tuning is a good strategy to solve this problem;however,this approach needs an additional labeled dataset similar to the target task.To enhance the geometric invariance of CNN,the CNN descriptor space consumption and extraction time will inevitably increase,and only scale invariance is usually considered for simplicity,ignoring other aspects of invariance.Moreover,the number of CNN features from one image is usually much smaller than that of SIFT;thus,insufficient information for encoding will be captured.The most commonly used CNNs are designed for image classification tasks and not for image retrieval.However,image retrieval is a more fine-grained domain;a relevant algorithm needs to find similar images,not just the images from one class.Thus,a CNN trained for image retrieval may be a good future research direction.More work is still needed to strike a better balance among generalizability,invariance,memory consumption,and extraction time for an effective and efficient image retrieval descriptor.