Scaling While Privacy Preserving: A Comprehensive Synthetic Tabular Data Generation and Evaluation in Learning Analytics

Qinyi Liu,Mohammad Khalil,Ronas Shakya,Jelena Jovanovic
2024-01-13
Abstract:Privacy poses a significant obstacle to the progress of learning analytics (LA), presenting challenges like inadequate anonymization and data misuse that current solutions struggle to address. Synthetic data emerges as a potential remedy, offering robust privacy protection. However, prior LA research on synthetic data lacks thorough evaluation, essential for assessing the delicate balance between privacy and data utility. Synthetic data must not only enhance privacy but also remain practical for data analytics. Moreover, diverse LA scenarios come with varying privacy and utility needs, making the selection of an appropriate synthetic data approach a pressing challenge. To address these gaps, we propose a comprehensive evaluation of synthetic data, which encompasses three dimensions of synthetic data quality, namely resemblance, utility, and privacy. We apply this evaluation to three distinct LA datasets, using three different synthetic data generation methods. Our results show that synthetic data can maintain similar utility (i.e., predictive performance) as real data, while preserving privacy. Furthermore, considering different privacy and data utility requirements in different LA scenarios, we make customized recommendations for synthetic data generation. This paper not only presents a comprehensive evaluation of synthetic data but also illustrates its potential in mitigating privacy concerns within the field of LA, thus contributing to a wider application of synthetic data in LA and promoting a better practice for open science.
Cryptography and Security,Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper mainly focuses on the privacy protection issues in the field of Learning Analytics (LA). Specifically, it attempts to solve the following key problems: 1. **Balance between privacy protection and data utility**: - **Privacy issues**: One of the main obstacles faced by current learning analysis projects is privacy. Existing anonymization and data protection methods are insufficient and are prone to lead to data abuse and privacy leakage. - **Data utility**: Synthetic data must not only enhance privacy protection but also maintain sufficient practicality to support data analysis tasks. 2. **Quality assessment of synthetic data**: - **Multi - dimensional assessment**: Previous studies often only considered some quality dimensions (such as similarity, utility or privacy) when assessing synthetic data, lacking a comprehensive assessment. This paper proposes a comprehensive assessment method covering three dimensions (similarity, utility and privacy). - **Requirements of different scenarios**: Different learning analysis application scenarios have different requirements for privacy and data utility, so a customized synthetic data generation method is needed to meet these requirements. 3. **Promoting open science**: - **Data sharing**: By using synthetic data, data sharing can be achieved without exposing individual information, thus promoting the development of open science. ### Specific research questions To achieve the above goals, this paper proposes the following specific research questions: - **RQ1**: How to conduct a comprehensive assessment of synthetic tabular data in the field of learning analysis, including similarity, utility and privacy? - **RQ2**: To what extent can the use of synthetic tabular data improve privacy protection in learning analysis while maintaining the performance of prediction models? - **RQ3**: How to customize the generation of synthetic tabular data according to different learning analysis prediction modeling scenarios? ### Paper contributions The main contributions of this paper include: - Providing the first comprehensive three - dimensional (similarity, utility and privacy) assessment of synthetic tabular data in the field of learning analysis. - Demonstrating that synthetic data can provide stronger privacy protection while maintaining similar prediction performance to real data. - Discussing methods to balance privacy and prediction modeling requirements in different learning analysis scenarios and proposing how to customize synthetic data to better adapt to specific requirements. Through these contributions, this paper aims to provide more reliable guidelines for the use of synthetic data for researchers in the field of learning analysis, thereby promoting the wide application of synthetic data in this field.