Abstract:Deep learning has been widely used for extracting values from big data. As many other machine learning algorithms, deep learning requires significant training data. Experiments have shown both the volume and the quality of training data can significantly impact the effectiveness of the value extraction. In some cases, the volume of training data is not sufficiently large for effectively training a deep learning model. In other cases, the quality of training data is not high enough to achieve the optimal performance. Many approaches have been proposed for augmenting training data to mitigate the deficiency. However, whether the augmented data are “fit for purpose” of deep learning is still a question. A framework for comprehensively evaluating the effectiveness of the augmented data for deep learning is still not available. In this article, we first discuss a data augmentation approach for deep learning. The approach includes two components: the first one is to remove noisy data in a dataset using a machine learning based classification to improve its quality, and the second one is to increase the volume of the dataset for effectively training a deep learning model. To evaluate the quality of the augmented data in fidelity, variety, and veracity, a data quality evaluation framework is proposed. We demonstrated the effectiveness of the data augmentation approach and the data quality evaluation framework through studying an automated classification of biology cell images using deep learning. The experimental results clearly demonstrated the impact of the volume and quality of training data to the performance of deep learning and the importance of the data quality evaluation. The data augmentation approach and the data quality evaluation framework can be straightforwardly adapted for deep learning study in other domains.

Investigating the Effectiveness of Data Augmentation from Similarity and Diversity: an Empirical Study

Boosting Unsupervised Contrastive Learning Using Diffusion-Based Data Augmentation from Scratch

Affinity and Diversity: Quantifying Mechanisms of Data Augmentation

ADQE: Obtain Better Deep Learning Models by Evaluating the Augmented Data Quality Using Information Entropy

Effective Data Augmentation With Diffusion Models

Image Data Augmentation for Deep Learning: A Survey

A Good Data Augmentation Policy Is Not All You Need: A Multi-Task Learning Perspective

Data Augmentation Revisited: Rethinking the Distribution Gap between Clean and Augmented Data

WeMix: How to Better Utilize Data Augmentation

DualAug: Exploiting Additional Heavy Augmentation with OOD Data Rejection

Exploring Data Augmentations on Self-/Semi-/Fully- Supervised Pre-trained Models

Data-Efficient Augmentation for Training Neural Networks

A Case Study of the Augmentation and Evaluation of Training Data for Deep Learning

A Simple Background Augmentation Method for Object Detection with Diffusion Model

Enabling Data Diversity: Efficient Automatic Augmentation via Regularized Adversarial Training

Understanding Data Augmentation from a Robustness Perspective

A Comprehensive Survey on Data Augmentation

Decoupled Data Augmentation for Improving Image Classification

Evaluating the Impact of Data Augmentation on Predictive Model Performance

Tied-Augment: Controlling Representation Similarity Improves Data Augmentation