Fraud Dataset Benchmark and Applications

Prince Grover,Julia Xu,Justin Tittelfitz,Anqi Cheng,Zheng Li,Jakub Zablocki,Jianbo Liu,Hao Zhou
2023-09-22
Abstract:Standardized datasets and benchmarks have spurred innovations in computer vision, natural language processing, multi-modal and tabular settings. We note that, as compared to other well researched fields, fraud detection has unique challenges: high-class imbalance, diverse feature types, frequently changing fraud patterns, and adversarial nature of the problem. Due to these, the modeling approaches evaluated on datasets from other research fields may not work well for the fraud detection. In this paper, we introduce Fraud Dataset Benchmark (FDB), a compilation of publicly available datasets catered to fraud detection FDB comprises variety of fraud related tasks, ranging from identifying fraudulent card-not-present transactions, detecting bot attacks, classifying malicious URLs, estimating risk of loan default to content moderation. The Python based library for FDB provides a consistent API for data loading with standardized training and testing splits. We demonstrate several applications of FDB that are of broad interest for fraud detection, including feature engineering, comparison of supervised learning algorithms, label noise removal, class-imbalance treatment and semi-supervised learning. We hope that FDB provides a common playground for researchers and practitioners in the fraud detection domain to develop robust and customized machine learning techniques targeting various fraud use cases.
Machine Learning,Cryptography and Security
What problem does this paper attempt to address?
The paper aims to address the following key issues: ### Research Background and Objectives - **Importance of Standardized Datasets**: Standardized datasets and benchmarks have driven advancements in fields such as computer vision and natural language processing. - **Challenges in Fraud Detection**: Compared to other research areas, fraud detection faces unique challenges, including high class imbalance, diverse feature types, frequent changes in fraud patterns, and the adversarial nature of the problem. ### Solution - **Introducing FDB (Fraud Dataset Benchmark)**: To tackle these challenges, the authors introduce FDB, a collection of publicly available datasets specifically for fraud detection. - **Features of FDB**: - Includes various fraud-related tasks such as identifying fraudulent card-not-present transactions, detecting bot attacks, and classifying malicious URLs. - Provides a Python library to support data loading and standard train/test splits. - Demonstrates the application of FDB in feature engineering, supervised learning algorithm comparison, label noise removal, class imbalance handling, and semi-supervised learning. ### Specific Contributions - **Dataset Collection and Selection**: Collected 9 publicly available fraud-related datasets from multiple sources, covering different fraud scenarios. - **Dataset Characteristics**: - Class Imbalance: Extremely low proportion of fraud samples (e.g., 0.0001). - High Cardinality Features: Such as IP addresses, phone numbers, etc. - Adversarial Nature: Fraudsters change behavior to evade model detection. - Non-Independent and Identically Distributed (non-IID) Data: Attribute values and behaviors depend on historical values. - **Application Case Demonstrations**: - Impact of feature engineering and data enrichment on supervised learning. - Evaluation of label noise removal techniques. - Comparison of class imbalance handling methods. - Effectiveness evaluation of semi-supervised learning methods. Through these efforts, the paper aims to provide a common platform for researchers and practitioners in the field of fraud detection to develop robust and customized machine learning techniques for various fraud use cases.