Abstract:We live in an increasingly mobile world, which leads to the duplication of information across domains. Though organizations attempt to obscure the identities of their constituents when sharing information for worthwhile purposes, such as basic research, the uncoordinated nature of such environment can lead to privacy vulnerabilities. For instance, disparate healthcare providers can collect information on the same patient. Federal policy requires that such providers share "de-identified" sensitive data, such as biomedical (e.g., clinical and genomic) records. But at the same time, such providers can share identified information, devoid of sensitive biomedical data, for administrative functions. On a provider-by-provider basis, the biomedical and identified records appear unrelated, however, links can be established when multiple providers' databases are studied jointly. The problem, known as trail disclosure, is a generalized phenomenon and occurs because an individual's location access pattern can be matched across the shared databases. Due to technical and legal constraints, it is often difficult to coordinate between providers and thus it is critical to assess the disclosure risk in distributed environments, so that we can develop techniques to mitigate such risks. Research on privacy protection has so far focused on developing technologies to suppress or encrypt identifiers associated with sensitive information. There is growing body of work on the formal assessment of the disclosure risk of database entries in publicly shared databases, but a less attention has been paid to the distributed setting. In this research, we review the trail disclosure problem in several domains with known vulnerabilities and show that disclosure risk is influenced by the distribution of how people visit service providers. Based on empirical evidence, we propose an entropy metric for assessing such risk in shared databases prior to their release. This metric assesses risk by leveraging the statistical characteristics of a visit distribution, as opposed to person-level data. It is computationally efficient and superior to existing risk assessment methods, which rely on ad hoc assessment that are often computationally expensive and unreliable. We evaluate our approach on a range of location access patterns in simulated environments. Our results demonstrate the approach is effective at estimating trail disclosure risks and the amount of self-information contained in a distributed system is one of the main driving factors.

Disclosure risk assessment with Bayesian non-parametric hierarchical modelling

DPHMM: Customizable Data Release with Differential Privacy Via Hidden Markov Model.

Assessing Statistical Disclosure Risk for Differentially Private, Hierarchical Count Data, with Application to the 2020 U.S. Decennial Census

A bayesian hierarchical spatial model for dental caries assessment using non-gaussian markov random fields.

A Novel Microdata Privacy Disclosure Risk Measure

Bayesian Models for Heterogeneous Personalized Health Data

Bayesian Data Synthesis and Disclosure Risk Quantification: An Application to the Consumer Expenditure Surveys

Optimal disclosure risk assessment

Distributed Learning from Multi-Site Observational Health Data for Zero-Inflated Count Outcomes

A Bayesian hierarchical small-area population model accounting for data source specific methodologies from American Community Survey, Population Estimates Program, and Decennial Census data

A Tutorial in Assessing Disclosure Risk in Microdata

Bayesian Approaches to Collaborative Data Analysis with Strict Privacy Restrictions

Bayesian Analysis of Population Health Data

A Bayesian nonparametric approach to correct for underreporting in count data

Bayesian Estimation of Attribute Disclosure Risks in Synthetic Data with the $\texttt{AttributeRiskCalculation}$ R Package

Data Augmentation MCMC for Bayesian Inference from Privatized Data

An Entropy Approach to Disclosure Risk Assessment: Lessons from Real Applications and Simulated Domains

Nonparametric Bayes models for mixed-scale longitudinal surveys

Bayesian modelling for spatially misaligned health areal data: a multiple membership approach

Bayesian nonparametric hierarchical modeling for multiple membership data in grouped attendance interventions

Bayesian Analysis of Generalized Hierarchical Indian Buffet Processes for Within and Across Group Sharing of Latent Features