Application of deep learning models to generate rich, dynamic and production-like test data

Tan, Chao
DOI: https://doi.org/10.1007/s10664-024-10541-w
IF: 3.762
2024-10-19
Empirical Software Engineering
Abstract:Traditionally, software development teams in many industries have used copies of production databases or their masked, anonymized, or obfuscated versions for testing. However, privacy protection regulations, for example, the General Data Protection Regulation (GDPR), prohibit such practices. In such a situation, there is often a need to generate production-like test data, i.e., test data that is statistically representative of the production data and conforms to the domain's constraints. In this paper, we address this need by presenting a novel approach for generating production-like test data using deep learning techniques and studying the practical effectiveness of our proposed approach in industrial settings. We frame the problem of generating production-like test data as a Language Modeling problem. We then propose a general solution for test data generation and a framework for evaluating and comparing language models based on training effectiveness and the representativeness and validity of the generated data. To evaluate the practical effectiveness of our solution, we apply it to a case study: the Norwegian National Population Registry (NPR). Within the context of NPR, we experiment with three of the most successful Deep Learning algorithms for Language Modeling, namely Recurrent Neural Networks (RNN), Variational Autoencoders, and Generative Adversarial Networks (GANs). We furthermore evaluate and compare the effectiveness of these algorithms quantitatively, using the proposed evaluation framework. The results from our case study show that our approach can generate highly complex data that is statistically representative of the production data and conforms to the business rules of the domain. Moreover, test data generated with RNN, which outperforms the other two algorithms, are syntactically and semantically valid in more than 97% of the cases and are highly representative of the real NPR data. The practical applicability of our approach is evident from the fact that our approach was fully deployed in a test environment at the NPR, generating on-the-fly and scalable amounts of production-like test data that are used in the integration testing between NPR and its data consumers.
computer science, software engineering
What problem does this paper attempt to address?