Abstract:Privacy poses a significant obstacle to the progress of learning analytics (LA), presenting challenges like inadequate anonymization and data misuse that current solutions struggle to address. Synthetic data emerges as a potential remedy, offering robust privacy protection. However, prior LA research on synthetic data lacks thorough evaluation, essential for assessing the delicate balance between privacy and data utility. Synthetic data must not only enhance privacy but also remain practical for data analytics. Moreover, diverse LA scenarios come with varying privacy and utility needs, making the selection of an appropriate synthetic data approach a pressing challenge. To address these gaps, we propose a comprehensive evaluation of synthetic data, which encompasses three dimensions of synthetic data quality, namely resemblance, utility, and privacy. We apply this evaluation to three distinct LA datasets, using three different synthetic data generation methods. Our results show that synthetic data can maintain similar utility (i.e., predictive performance) as real data, while preserving privacy. Furthermore, considering different privacy and data utility requirements in different LA scenarios, we make customized recommendations for synthetic data generation. This paper not only presents a comprehensive evaluation of synthetic data but also illustrates its potential in mitigating privacy concerns within the field of LA, thus contributing to a wider application of synthetic data in LA and promoting a better practice for open science.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper mainly focuses on the privacy protection issues in the field of Learning Analytics (LA). Specifically, it attempts to solve the following key problems: 1. **Balance between privacy protection and data utility**: - **Privacy issues**: One of the main obstacles faced by current learning analysis projects is privacy. Existing anonymization and data protection methods are insufficient and are prone to lead to data abuse and privacy leakage. - **Data utility**: Synthetic data must not only enhance privacy protection but also maintain sufficient practicality to support data analysis tasks. 2. **Quality assessment of synthetic data**: - **Multi - dimensional assessment**: Previous studies often only considered some quality dimensions (such as similarity, utility or privacy) when assessing synthetic data, lacking a comprehensive assessment. This paper proposes a comprehensive assessment method covering three dimensions (similarity, utility and privacy). - **Requirements of different scenarios**: Different learning analysis application scenarios have different requirements for privacy and data utility, so a customized synthetic data generation method is needed to meet these requirements. 3. **Promoting open science**: - **Data sharing**: By using synthetic data, data sharing can be achieved without exposing individual information, thus promoting the development of open science. ### Specific research questions To achieve the above goals, this paper proposes the following specific research questions: - **RQ1**: How to conduct a comprehensive assessment of synthetic tabular data in the field of learning analysis, including similarity, utility and privacy? - **RQ2**: To what extent can the use of synthetic tabular data improve privacy protection in learning analysis while maintaining the performance of prediction models? - **RQ3**: How to customize the generation of synthetic tabular data according to different learning analysis prediction modeling scenarios? ### Paper contributions The main contributions of this paper include: - Providing the first comprehensive three - dimensional (similarity, utility and privacy) assessment of synthetic tabular data in the field of learning analysis. - Demonstrating that synthetic data can provide stronger privacy protection while maintaining similar prediction performance to real data. - Discussing methods to balance privacy and prediction modeling requirements in different learning analysis scenarios and proposing how to customize synthetic data to better adapt to specific requirements. Through these contributions, this paper aims to provide more reliable guidelines for the use of synthetic data for researchers in the field of learning analysis, thereby promoting the wide application of synthetic data in this field.

Scaling While Privacy Preserving: A Comprehensive Synthetic Tabular Data Generation and Evaluation in Learning Analytics

A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models

Differentially Private Synthetic Data: Applied Evaluations and Enhancements

Boosting Data Analytics With Synthetic Volume Expansion

Privacy Risk Assessment for Synthetic Longitudinal Health Data

SynthEval: A Framework for Detailed Utility and Privacy Evaluation of Tabular Synthetic Data

A Synthetic Dataset for Personal Attribute Inference

Privacy-Preserving Synthetic Educational Data Generation

Utility Assessment of Synthetic Data Generation Methods

Privacy-Preserving Synthetic Data Generation for Recommendation Systems

Evaluating Differentially Private Synthetic Data Generation in High-Stakes Domains

PrivSyn: Differentially Private Data Synthesis

Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation

Sharing is CAIRing: Characterizing Principles and Assessing Properties of Universal Privacy Evaluation for Synthetic Tabular Data

A Unified Framework for Quantifying Privacy Risk in Synthetic Data

Assessment of differentially private synthetic data for utility and fairness in end-to-end machine learning pipelines for tabular data

Statistical properties and privacy guarantees of an original distance-based fully synthetic data generation method

On Utility and Privacy in Synthetic Genomic Data

Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data

The Real Deal Behind the Artificial Appeal: Inferential Utility of Tabular Synthetic Data

Trading Off Scalability, Privacy, and Performance in Data Synthesis