Automated Detection and Analysis of Data Practices Using A Real-World Corpus

Mukund Srinath,Pranav Venkit,Maria Badillo,Florian Schaub,C. Lee Giles,Shomir Wilson
2024-02-17
Abstract:Privacy policies are crucial for informing users about data practices, yet their length and complexity often deter users from reading them. In this paper, we propose an automated approach to identify and visualize data practices within privacy policies at different levels of detail. Leveraging crowd-sourced annotations from the ToS;DR platform, we experiment with various methods to match policy excerpts with predefined data practice descriptions. We further conduct a case study to evaluate our approach on a real-world policy, demonstrating its effectiveness in simplifying complex policies. Experiments show that our approach accurately matches data practice descriptions with policy excerpts, facilitating the presentation of simplified privacy information to users.
Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to simplify and automate the identification and visualization of data practices in privacy policies, in order to help users understand these complex and lengthy documents more easily. Specifically: 1. **Problems of complexity and readability of privacy policies**: - Privacy policies are usually very long and difficult to understand, causing users to be unwilling or unable to read them carefully. - Users need to spend a great deal of time (about 200 hours per year) to read all the privacy policies they encounter. - These policies usually use complex legal terms, making it difficult for ordinary users to understand. 2. **Automated data practice matching**: - By using the crowdsourced annotation data on the ToS;DR platform, the researchers proposed an automatic method to match the paragraphs in privacy policies with predefined data practice descriptions. - This method aims to simplify the presentation of privacy information, enabling users to more easily understand how their online data is processed. 3. **Multi - level privacy label design**: - A privacy label system has been designed, which can provide users with information about their data privacy at different levels of detail. - The label system is divided into different rating categories (such as block, poor, neutral, good) according to the quality of data practices, and is represented by color - coding (red, yellow, gray, green). ### Core contributions of the paper - **Manual analysis and clustering**: A manual analysis of the data descriptions on the ToS;DR platform was carried out, and similar data practices were clustered together. - **Automatic matching method**: A method for automatically matching privacy policy paragraphs with data practice descriptions was developed. - **Privacy label design**: A multi - level privacy label was designed to provide privacy information at different levels of detail. Through these methods, the researchers hope to lower the threshold for users to understand and evaluate privacy policies, enabling them to make better - informed privacy decisions.