IDNet: A Novel Dataset for Identity Document Analysis and Fraud Detection

Hong Guan,Yancheng Wang,Lulu Xie,Soham Nag,Rajeev Goel,Niranjan Erappa Narayana Swamy,Yingzhen Yang,Chaowei Xiao,Jonathan Prisby,Ross Maciejewski,Jia Zou

2024-09-04

Abstract:Effective fraud detection and analysis of government-issued identity documents, such as passports, driver's licenses, and identity cards, are essential in thwarting identity theft and bolstering security on online platforms. The training of accurate fraud detection and analysis tools depends on the availability of extensive identity document datasets. However, current publicly available benchmark datasets for identity document analysis, including MIDV-500, MIDV-2020, and FMIDV, fall short in several respects: they offer a limited number of samples, cover insufficient varieties of fraud patterns, and seldom include alterations in critical personal identifying fields like portrait images, limiting their utility in training models capable of detecting realistic frauds while preserving privacy. In response to these shortcomings, our research introduces a new benchmark dataset, IDNet, designed to advance privacy-preserving fraud detection efforts. The IDNet dataset comprises 837,060 images of synthetically generated identity documents, totaling approximately 490 gigabytes, categorized into 20 types from $10$ U.S. states and 10 European countries. We evaluate the utility and present use cases of the dataset, illustrating how it can aid in training privacy-preserving fraud detection methods, facilitating the generation of camera and video capturing of identity documents, and testing schema unification and other identity document management functionalities.

Computer Vision and Pattern Recognition,Artificial Intelligence,Multimedia

What problem does this paper attempt to address?

The problem this paper attempts to address is the inadequacy of existing identity document analysis and fraud detection datasets in terms of quantity, diversity, and fraud patterns, which cannot effectively support privacy-preserving fraud detection research. Specifically: 1. **Limited sample size**: Existing datasets such as MIDV-500 and MIDV-2020 contain a small number of identity documents, insufficient for training complex AI/ML models. 2. **Insufficient fraud patterns**: Most existing datasets only include simple fraud patterns, such as crop-and-move and inpaint-and-rewrite, lacking complex fraud patterns like face morphing and portrait substitution. 3. **Insufficient protection of Personally Identifiable Information (PII)**: The PII in existing datasets (such as portrait photos) is usually not adequately processed, failing to meet privacy protection requirements. To address these issues, the paper introduces a new benchmark dataset, IDNet, which contains 837,060 synthetically generated identity document images, covering 20 different types of documents. The IDNet dataset is not only large in quantity but also includes various fraud patterns and considers privacy protection needs during the generation process. Through these improvements, IDNet aims to enhance the accuracy and practicality of privacy-preserving fraud detection methods.

IDNet: A Novel Dataset for Identity Document Analysis and Fraud Detection

MIDV-2020: A Comprehensive Benchmark Dataset for Identity Document Analysis

DocXPand-25k: a large and diverse benchmark dataset for identity documents analysis

Synthetic dataset of ID and Travel Document

Signature Detection, Restoration, and Verification: A Novel Chinese Document Signature Forgery Detection Benchmark

DocFace: Matching ID Document Photos to Selfies

IDTrust: Deep Identity Document Quality Detection with Bandpass Filtering

DocFace+: ID Document to Selfie Matching

Key-Guided Identity Document Classification Method by Graph Attention Network

Face Detection in Camera Captured Images of Identity Documents under Challenging Conditions

A Novel Approach to Enhancing Identity Document Authentication with Copy-Move Forgery Detection using CNN

Identifying fraudulent identity documents by analyzing imprinted guilloche patterns

IEIRNet: Inconsistency Exploiting Based Identity Rectification for Face Forgery Detection

Fraud Dataset Benchmark and Applications

An Intelligent Hybrid Model for Identity Document Classification

eKYC-DF: A Large-Scale Deepfake Dataset for Developing and Evaluating eKYC Systems

On Physically Occluded Fake Identity Document Detection

An efficient method to detect series of fraudulent identity documents based on digitised forensic data

Spritz-PS: Validation of Synthetic Face Images Using a Large Dataset of Printed Documents

DiffusionFace: Towards a Comprehensive Dataset for Diffusion-Based Face Forgery Analysis

Open-Set: ID Card Presentation Attack Detection Using Neural Style Transfer