Transductive Learning for Near-Duplicate Image Detection in Scanned Photo Collections

Francesc Net,Marc Folia,Pep Casals,Lluis Gomez
2024-10-25
Abstract:This paper presents a comparative study of near-duplicate image detection techniques in a real-world use case scenario, where a document management company is commissioned to manually annotate a collection of scanned photographs. Detecting duplicate and near-duplicate photographs can reduce the time spent on manual annotation by archivists. This real use case differs from laboratory settings as the deployment dataset is available in advance, allowing the use of transductive learning. We propose a transductive learning approach that leverages state-of-the-art deep learning architectures such as convolutional neural networks (CNNs) and Vision Transformers (ViTs). Our approach involves pre-training a deep neural network on a large dataset and then fine-tuning the network on the unlabeled target collection with self-supervised learning. The results show that the proposed approach outperforms the baseline methods in the task of near-duplicate image detection in the UKBench and an in-house private dataset.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to address the problem of detecting approximately duplicate images in a collection of scanned photos. Specifically, the paper focuses on a real-world scenario where a document management company is tasked with manually annotating a batch of scanned photos and seeks to reduce the time spent by archivists on manual annotation by detecting duplicates and near-duplicates. This real-world application differs from laboratory research because the target dataset is available before deployment, allowing the use of transductive learning methods. ### Main contributions of the paper 1. **Application of transductive learning**: The paper proposes a transductive learning-based approach that utilizes state-of-the-art deep learning architectures (such as Convolutional Neural Networks (CNN) and Vision Transformers (ViT)) for detecting approximately duplicate images. This method involves pre-training deep neural networks on large-scale datasets and then fine-tuning them on the unlabeled target set using self-supervised learning. 2. **Experimental validation**: The paper conducts experiments on two datasets, one being the publicly available UKBench dataset and the other an internal private dataset. The experimental results demonstrate that the proposed transductive learning method outperforms baseline methods in the task of detecting approximately duplicate images. 3. **Practical application value**: The research is applicable not only to photo collections but also to the detection of near-duplicates in document images. Some samples used in the experiments contain textual information (scene/handwritten), further validating the broad applicability of the method. ### Key technologies - **Transductive learning**: Utilizing information from the test data to improve model performance. Compared to traditional inductive learning, transductive learning can adapt more effectively to specific datasets. - **Self-supervised learning**: Learning useful representations without labeled data, thereby enhancing model performance. - **Deep learning architectures**: Using popular computer vision architectures such as ResNet and ViT, and considering both supervised and self-supervised training strategies. ### Experimental results - **UKBench dataset**: On this dataset, supervised learning methods (especially the ResNet50 model pre-trained and fine-tuned on UKBench) performed the best, achieving a mAP@4 of 0.943. Self-supervised learning methods (such as MAE and SimCLR) also performed well but were slightly inferior to supervised methods. - **Internal private dataset**: Due to the lack of sufficient labeled data, only self-supervised learning methods could be used. The results showed that MAE (ViT-L-16) achieved the best performance with a Precision@10 of 0.218. ### Conclusion The paper demonstrates through experiments that in the absence of labeled data, using transductive learning and self-supervised learning methods can effectively detect approximately duplicate images, significantly improving detection performance. Future research can further explore the application of ViT foundation models pre-trained on larger-scale datasets combined with self-supervised learning in archival work.