Abstract:This paper presents a comparative study of near-duplicate image detection techniques in a real-world use case scenario, where a document management company is commissioned to manually annotate a collection of scanned photographs. Detecting duplicate and near-duplicate photographs can reduce the time spent on manual annotation by archivists. This real use case differs from laboratory settings as the deployment dataset is available in advance, allowing the use of transductive learning. We propose a transductive learning approach that leverages state-of-the-art deep learning architectures such as convolutional neural networks (CNNs) and Vision Transformers (ViTs). Our approach involves pre-training a deep neural network on a large dataset and then fine-tuning the network on the unlabeled target collection with self-supervised learning. The results show that the proposed approach outperforms the baseline methods in the task of near-duplicate image detection in the UKBench and an in-house private dataset.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the problem of detecting approximately duplicate images in a collection of scanned photos. Specifically, the paper focuses on a real-world scenario where a document management company is tasked with manually annotating a batch of scanned photos and seeks to reduce the time spent by archivists on manual annotation by detecting duplicates and near-duplicates. This real-world application differs from laboratory research because the target dataset is available before deployment, allowing the use of transductive learning methods. ### Main contributions of the paper 1. **Application of transductive learning**: The paper proposes a transductive learning-based approach that utilizes state-of-the-art deep learning architectures (such as Convolutional Neural Networks (CNN) and Vision Transformers (ViT)) for detecting approximately duplicate images. This method involves pre-training deep neural networks on large-scale datasets and then fine-tuning them on the unlabeled target set using self-supervised learning. 2. **Experimental validation**: The paper conducts experiments on two datasets, one being the publicly available UKBench dataset and the other an internal private dataset. The experimental results demonstrate that the proposed transductive learning method outperforms baseline methods in the task of detecting approximately duplicate images. 3. **Practical application value**: The research is applicable not only to photo collections but also to the detection of near-duplicates in document images. Some samples used in the experiments contain textual information (scene/handwritten), further validating the broad applicability of the method. ### Key technologies - **Transductive learning**: Utilizing information from the test data to improve model performance. Compared to traditional inductive learning, transductive learning can adapt more effectively to specific datasets. - **Self-supervised learning**: Learning useful representations without labeled data, thereby enhancing model performance. - **Deep learning architectures**: Using popular computer vision architectures such as ResNet and ViT, and considering both supervised and self-supervised training strategies. ### Experimental results - **UKBench dataset**: On this dataset, supervised learning methods (especially the ResNet50 model pre-trained and fine-tuned on UKBench) performed the best, achieving a mAP@4 of 0.943. Self-supervised learning methods (such as MAE and SimCLR) also performed well but were slightly inferior to supervised methods. - **Internal private dataset**: Due to the lack of sufficient labeled data, only self-supervised learning methods could be used. The results showed that MAE (ViT-L-16) achieved the best performance with a Precision@10 of 0.218. ### Conclusion The paper demonstrates through experiments that in the absence of labeled data, using transductive learning and self-supervised learning methods can effectively detect approximately duplicate images, significantly improving detection performance. Future research can further explore the application of ViT foundation models pre-trained on larger-scale datasets combined with self-supervised learning in archival work.

Transductive Learning for Near-Duplicate Image Detection in Scanned Photo Collections

Benchmarking unsupervised near-duplicate image detection

Single- and Cross-Modality Near Duplicate Image Pairs Detection Via Spatial Transformer Comparing CNN

Efficient near-duplicate image detection by learning from examples

A Novel Approach for Annotation-based Image Retrieval Using Deep Architecture.

Dataset and Case Studies for Visual Near-Duplicates Detection in the Context of Social Media

Near-duplicate Keyframe Retrieval by Semi-Supervised Learning and Nonrigid Image Matching

Evolution of a Web-Scale Near Duplicate Image Detection System

A Novel Approach for Hyperspectral Change Detection Based on Uncertain Area Analysis and Improved Transfer Learning

Learning to Detect Multiple Photographic Defects

Benchmarking Pretrained Vision Embeddings for Near- and Duplicate Detection in Medical Images

Vision Transformers Are Active Learners for Image Copy Detection

Detecting Near-Duplicate Face Images

Near-Duplicate Image Detection System Using Coarse-to-Fine Matching Scheme Based on Global and Local CNN Features

Near Duplicate Image Pairs Detection Using Double-Channel Convolutional Neural Networks

Transfer Learning-Based Models for Comparative Evaluation for the Detection of AI-Generated Images

A Review on Near Duplicate Detection of Images using Computer Vision Techniques

Near Duplicate Document Detection using Document Image

Near-duplicate detection for images and videos.

Effective Image Differencing with ConvNets for Real-time Transient Hunting

A Deep Learning Approach to Universal Image Manipulation Detection Using a New Convolutional Layer