A Public and Reproducible Assessment of the Topics API on Real Data

Yohan Beugin,Patrick McDaniel
2024-08-16
Abstract:The Topics API for the web is Google's privacy-enhancing alternative to replace third-party cookies. Results of prior work have led to an ongoing discussion between Google and research communities about the capability of Topics to trade off both utility and privacy. The central point of contention is largely around the realism of the datasets used in these analyses and their reproducibility; researchers using data collected on a small sample of users or generating synthetic datasets, while Google's results are inferred from a private dataset. In this paper, we complement prior research by performing a reproducible assessment of the latest version of the Topics API on the largest and publicly available dataset of real browsing histories. First, we measure how unique and stable real users' interests are over time. Then, we evaluate if Topics can be used to fingerprint the users from these real browsing traces by adapting methodologies from prior privacy studies. Finally, we call on web actors to perform and enable reproducible evaluations by releasing anonymized distributions. We find that for the 1207 real users in this dataset, the probability of being re-identified across websites is of 2%, 3%, and 4% after 1, 2, and 3 observations of their topics by advertisers, respectively. This paper shows on real data that Topics does not provide the same privacy guarantees to all users and that the information leakage worsens over time, further highlighting the need for public and reproducible evaluations of the claims made by new web proposals.
Cryptography and Security
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to evaluate the privacy and practicality of Google's Topics API on real - data. Specifically, the paper focuses on the following aspects: 1. **Privacy evaluation**: The paper aims to evaluate whether the Topics API can protect users' privacy while providing advertising utility by using the largest publicly available real - browsing - history dataset. Most previous studies were based on small - sample user data or synthetic datasets, while Google used private datasets for evaluation, which led to non - repeatable results. 2. **Stability and uniqueness of user interests**: Researchers measured how real - user interests change over time to understand whether these interests are unique and stable enough to be potentially used for user fingerprinting. 3. **Fingerprinting risk**: The paper explored whether the Topics API can be used to track and re - identify users across websites. Specifically, researchers simulated how advertisers can associate the same user on different websites through the user's interest topics. 4. **Identification of noise topics**: Researchers also attempted to identify which of the topics returned by the API are noise topics (i.e., randomly added topics) to evaluate whether advertisers can distinguish between real topics and noise topics. 5. **k - anonymity evaluation**: The paper experimentally evaluated the probability of re - identifying users in multiple observations and discussed the k - anonymity level provided by the Topics API. ### Main findings - **Uniqueness and stability of user interests**: The study found that more than 93% of users have a unique top - 5 - topic profile every week. At least 47% of users have 3 or more common interest topics from one week to the next, while less than 6% of users have no common interest topics at all. - **Possibility of fingerprinting**: For 1,207 real - users in the dataset, the probabilities of being re - identified by third - parties after 1, 2, and 3 observations are 2%, 3%, and 4% respectively. This indicates that the Topics API does indeed pose a risk of user fingerprinting in some cases. - **Information leakage and privacy issues**: As the number of observations increases, information leakage intensifies, and more users are at risk of privacy. This further highlights the importance of public and repeatable evaluation of new web proposals. ### Conclusion The paper emphasizes the importance of public and repeatable evaluation of any new Web technology proposal, especially identifying potential limitations during the design stage rather than after deployment. In addition, the paper points out that although the Topics API aims to improve user privacy, in practical applications, it does not provide the same privacy protection for all users, and information leakage will worsen over time. Therefore, further research and improvement are needed to ensure better privacy protection. Hopefully, these summaries will help you understand the core content and main contributions of this paper. If you have more specific questions or need further explanation, please feel free to let me know!