CaPC Learning: Confidential and Private Collaborative Learning

Christopher A. Choquette-Choo,Natalie Dullerud,Adam Dziedzic,Yunxiang Zhang,Somesh Jha,Nicolas Papernot,Xiao Wang
DOI: https://doi.org/10.48550/arXiv.2102.05188
2021-03-20
Abstract:Machine learning benefits from large training datasets, which may not always be possible to collect by any single entity, especially when using privacy-sensitive data. In many contexts, such as healthcare and finance, separate parties may wish to collaborate and learn from each other's data but are prevented from doing so due to privacy regulations. Some regulations prevent explicit sharing of data between parties by joining datasets in a central location (confidentiality). Others also limit implicit sharing of data, e.g., through model predictions (privacy). There is currently no method that enables machine learning in such a setting, where both confidentiality and privacy need to be preserved, to prevent both explicit and implicit sharing of data. Federated learning only provides confidentiality, not privacy, since gradients shared still contain private information. Differentially private learning assumes unreasonably large datasets. Furthermore, both of these learning paradigms produce a central model whose architecture was previously agreed upon by all parties rather than enabling collaborative learning where each party learns and improves their own local model. We introduce Confidential and Private Collaborative (CaPC) learning, the first method provably achieving both confidentiality and privacy in a collaborative setting. We leverage secure multi-party computation (MPC), homomorphic encryption (HE), and other techniques in combination with privately aggregated teacher models. We demonstrate how CaPC allows participants to collaborate without having to explicitly join their training sets or train a central model. Each party is able to improve the accuracy and fairness of their model, even in settings where each party has a model that performs well on their own dataset or when datasets are not IID and model architectures are heterogeneous across parties.
Machine Learning,Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the situation where multiple entities (such as hospitals, financial institutions, etc.) hope to cooperate and learn from each other's data, how to conduct collaborative learning while protecting the confidentiality and privacy of data. Specifically, the paper aims to address the following challenges: 1. **Data privacy and confidentiality**: In many fields (such as healthcare and finance), different entities may hope to use each other's data to improve their own models, but are restricted by privacy regulations and cannot directly share data or implicitly share data through model predictions. 2. **Limitations of existing methods**: - **Federated learning**: Although it can provide data confidentiality, it cannot guarantee privacy because the shared gradients still contain private information. - **Differential privacy**: It is assumed that a very large data set is required, and it is usually trained on only one centralized model, ignoring the fact that each entity may have different model architectures. 3. **Heterogeneous models and a small number of participants**: Existing decentralized methods (such as federated learning) usually require a large number of participants to achieve differential privacy, but in practical applications, the number of participants is limited, and each participant's model architecture may be different. To solve these problems, the paper proposes the **Confidential and Private Collaborative (CaPC) learning** method, which is a new collaborative learning framework that enables different entities to jointly improve their local models while protecting data confidentiality and privacy. CaPC achieves this by combining secure multi - party computation (MPC), homomorphic encryption (HE) and other techniques. ### Specific problem description The problem in the paper can be formalized as follows: there are \( K \) entities \( \{P_i\}_{i = 1}^K \), each entity holds a private data set \( D_i=\{(x_j, y_j \text{ or } \emptyset)\}_{j = 1}^{N_i} \), and can fit a prediction model \( M_i \) on its data set. These entities hope to improve the performance of their respective models through cooperation, but due to the privacy of data, they cannot directly share data or data derivatives (such as model weights). Therefore, they will cooperate by querying each other's input labels. ### Threat model To obtain strong confidentiality and privacy guarantees, the paper introduces a semi - trusted third party - the Privacy Guardian (PG). It is assumed that PG will not collude with any entity, and an attacker can corrupt an arbitrary subset \( C \) of entities \( \{P_i\}_{i = 1}^C \). When more than one entity is corrupted, this will not affect the confidentiality guarantee, but the privacy budget \( \epsilon \) will degrade due to the increased sensitivity of the aggregation mechanism. ### Solution The paper proposes a novel private aggregation protocol for teacher models, which combines two - party confidential inference and secret sharing techniques to improve the work of Papernot et al. (2017) and ensure confidentiality. The specific steps are as follows: 1. The querying party \( P_{i^*} \) sends the encrypted input \( x \) to all answering parties \( P_i \) (\( i\neq i^* \)). 2. The answering parties use a secure two - party protocol to generate prediction results and calculate the final label through Yao's garbled circuit protocol. 3. PG aggregates the labels and adds noise to ensure differential privacy. 4. The querying party improves its local model according to the aggregated labels. In this way, CaPC can enable each entity to effectively cooperate and improve their respective models while protecting data confidentiality and privacy.