IDNet: A Novel Dataset for Identity Document Analysis and Fraud Detection

Hong Guan,Yancheng Wang,Lulu Xie,Soham Nag,Rajeev Goel,Niranjan Erappa Narayana Swamy,Yingzhen Yang,Chaowei Xiao,Jonathan Prisby,Ross Maciejewski,Jia Zou
2024-09-04
Abstract:Effective fraud detection and analysis of government-issued identity documents, such as passports, driver's licenses, and identity cards, are essential in thwarting identity theft and bolstering security on online platforms. The training of accurate fraud detection and analysis tools depends on the availability of extensive identity document datasets. However, current publicly available benchmark datasets for identity document analysis, including MIDV-500, MIDV-2020, and FMIDV, fall short in several respects: they offer a limited number of samples, cover insufficient varieties of fraud patterns, and seldom include alterations in critical personal identifying fields like portrait images, limiting their utility in training models capable of detecting realistic frauds while preserving privacy. In response to these shortcomings, our research introduces a new benchmark dataset, IDNet, designed to advance privacy-preserving fraud detection efforts. The IDNet dataset comprises 837,060 images of synthetically generated identity documents, totaling approximately 490 gigabytes, categorized into 20 types from $10$ U.S. states and 10 European countries. We evaluate the utility and present use cases of the dataset, illustrating how it can aid in training privacy-preserving fraud detection methods, facilitating the generation of camera and video capturing of identity documents, and testing schema unification and other identity document management functionalities.
Computer Vision and Pattern Recognition,Artificial Intelligence,Multimedia
What problem does this paper attempt to address?
The problem this paper attempts to address is the inadequacy of existing identity document analysis and fraud detection datasets in terms of quantity, diversity, and fraud patterns, which cannot effectively support privacy-preserving fraud detection research. Specifically: 1. **Limited sample size**: Existing datasets such as MIDV-500 and MIDV-2020 contain a small number of identity documents, insufficient for training complex AI/ML models. 2. **Insufficient fraud patterns**: Most existing datasets only include simple fraud patterns, such as crop-and-move and inpaint-and-rewrite, lacking complex fraud patterns like face morphing and portrait substitution. 3. **Insufficient protection of Personally Identifiable Information (PII)**: The PII in existing datasets (such as portrait photos) is usually not adequately processed, failing to meet privacy protection requirements. To address these issues, the paper introduces a new benchmark dataset, IDNet, which contains 837,060 synthetically generated identity document images, covering 20 different types of documents. The IDNet dataset is not only large in quantity but also includes various fraud patterns and considers privacy protection needs during the generation process. Through these improvements, IDNet aims to enhance the accuracy and practicality of privacy-preserving fraud detection methods.