A Comprehensive Study on GDPR-Oriented Analysis of Privacy Policies: Taxonomy, Corpus and GDPR Concept Classifiers

Peng Tang,Xin Li,Yuxin Chen,Weidong Qiu,Haochen Mei,Allison Holmes,Fenghua Li,Shujun Li
2024-10-07
Abstract:Machine learning based classifiers that take a privacy policy as the input and predict relevant concepts are useful in different applications such as (semi-)automated compliance analysis against requirements of the EU GDPR. In all past studies, such classifiers produce a concept label per segment (e.g., sentence or paragraph) and their performances were evaluated by using a dataset of labeled segments without considering the privacy policy they belong to. However, such an approach could overestimate the performance in real-world settings, where all segments in a new privacy policy are supposed to be unseen. Additionally, we also observed other research gaps, including the lack of a more complete GDPR taxonomy and the less consideration of hierarchical information in privacy policies. To fill such research gaps, we developed a more complete GDPR taxonomy, created the first corpus of labeled privacy policies with hierarchical information, and conducted the most comprehensive performance evaluation of GDPR concept classifiers for privacy policies. Our work leads to multiple novel findings, including the confirmed inappropriateness of splitting training and test sets at the segment level, the benefits of considering hierarchical information, and the limitations of the "one size fits all" approach, and the significance of testing cross-corpus generalizability.
Cryptography and Security
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address several key issues in privacy policy analysis, specifically including: 1. **Shortcomings of Existing Research Methods**: - **Limitations of Fragment-Level Evaluation**: Previous studies typically train and test privacy policies by splitting them into paragraphs or sentences, which may lead to an overestimation of performance evaluation in real-world applications. This is because, in real scenarios, all paragraphs in a new privacy policy are unseen. - **Lack of a Complete GDPR Taxonomy**: Existing research lacks a comprehensive GDPR taxonomy, which limits in-depth analysis of privacy policies. - **Ignoring Hierarchical Information**: Most studies do not fully consider the hierarchical structure information in privacy policies, which may affect the performance of classifiers. 2. **Improving Automated Analysis of Privacy Policy Compliance**: - **Developing a New Evaluation Framework**: Proposing a document-level performance evaluation framework to more accurately reflect the classifier's performance in real-world applications. - **Building a New Dataset**: Creating a privacy policy corpus (GoPPC-150) that includes hierarchical information to support more comprehensive analysis. - **Improving the Taxonomy**: Expanding and refining the GDPR taxonomy to make it more comprehensive and detailed. - **Comparative Study in Multiple Aspects**: Providing an in-depth understanding of GDPR concept classifier performance through a comprehensive comparison of various features, architectures, and classifiers. ### Main Contributions 1. **Document-Level Performance Evaluation Framework**: - Proposing a document-level performance evaluation framework to ensure that the training and test sets do not contain paragraphs from the same document, thereby more accurately reflecting the classifier's performance in real-world applications. - Demonstrating through comparison of document-level and paragraph-level performance results that the performance reported in previous studies is significantly overestimated. 2. **Comprehensive Comparative Study of GDPR Concept Classifiers**: - Considering contextual features based on the hierarchical nature of privacy policies. - Comparing two hierarchical classifier architectures (LCN and LCPN). - Gaining many new insights through extensive experiments, such as the notion that "one classifier fits all" does not hold, and the error propagation issue of the LCPN classifier. 3. **New Privacy Policy Corpus**: - Constructing the first fully hierarchical encoded GDPR-oriented privacy policy corpus GoPPC-150, including 150 privacy policies collected from top websites on Alexa.com. - The corpus includes expert-annotated GDPR concept labels and has a hierarchical structure, allowing the development of more context-aware GDPR concept classifiers and other related tools. 4. **Expanded GDPR Taxonomy**: - Proposing the most comprehensive GDPR privacy policy taxonomy based on existing smaller GDPR taxonomies, legal expert opinions, important documents from ICO and IAPP, and the work of the W3C Data Privacy Vocabulary and Controls Community Group (DPVCG). - The expanded taxonomy includes 96 nodes, an increase of 39.1% compared to the taxonomy proposed by Torre et al. Through these contributions, this paper not only fills gaps in existing research but also provides important foundations and tools for future GDPR-oriented privacy policy analysis.