Abstract:Unsupervised near-duplicate detection has many practical applications ranging from social media analysis and web-scale retrieval, to digital image forensics. It entails running a threshold-limited query on a set of descriptors extracted from the images, with the goal of identifying all possible near-duplicates, while limiting the false positives due to visually similar images. Since the rate of false alarms grows with the dataset size, a very high specificity is thus required, up to $1 - 10^{-9}$ for realistic use cases; this important requirement, however, is often overlooked in literature. In recent years, descriptors based on deep convolutional neural networks have matched or surpassed traditional feature extraction methods in content-based image retrieval tasks. To the best of our knowledge, ours is the first attempt to establish the performance range of deep learning-based descriptors for unsupervised near-duplicate detection on a range of datasets, encompassing a broad spectrum of near-duplicate definitions. We leverage both established and new benchmarks, such as the Mir-Flick Near-Duplicate (MFND) dataset, in which a known ground truth is provided for all possible pairs over a general, large scale image collection. To compare the specificity of different descriptors, we reduce the problem of unsupervised detection to that of binary classification of near-duplicate vs. not-near-duplicate images. The latter can be conveniently characterized using Receiver Operating Curve (ROC). Our findings in general favor the choice of fine-tuning deep convolutional networks, as opposed to using off-the-shelf features, but differences at high specificity settings depend on the dataset and are often small. The best performance was observed on the MFND benchmark, achieving 96\% sensitivity at a false positive rate of $1.43 \times 10^{-6}$.

Efficient near-duplicate image detection by learning from examples

Query Oriented Subspace Shifting for Near-Duplicate Image Detection

Near duplicate detection of images with area and proposed pixel‐based feature extraction

Fine-search for image copy detection based on local affine-invariant descriptor and spatial dependent matching

Near-duplicate Keyframe Retrieval by Nonrigid Image Matching.

Near-duplicate Keyframe Retrieval by Semi-Supervised Learning and Nonrigid Image Matching

Benchmarking unsupervised near-duplicate image detection

Fast and accurate near-duplicate image elimination for visual sensor networks

Near-Duplicate Image Detection System Using Coarse-to-Fine Matching Scheme Based on Global and Local CNN Features

Evolution of a Web-Scale Near Duplicate Image Detection System

Siamese coding network and pair similarity prediction for near-duplicate image detection

Near-duplicate image recognition

Efficient Feature Detection and Effective Post-Verification for Large Scale Near-Duplicate Image Search

Large-Scale Duplicate Detection for Web Image Search

Partial-Duplicate Image Retrieval via Saliency-Guided Visual Matching

Fast Image Retrieval Based on Equal-average Equal-variance K-Nearest Neighbour Search

Dataset and Case Studies for Visual Near-Duplicates Detection in the Context of Social Media

Squirrel Search Optimization-based near-duplicate image detection

Transductive Learning for Near-Duplicate Image Detection in Scanned Photo Collections

Representative local features mining for large-scale near-duplicates retrieval

Encoding Spatial Context for Large-Scale Partial-Duplicate Web Image Retrieval