Federated t-SNE and UMAP for Distributed Data Visualization

Dong Qiao,Xinxian Ma,Jicong Fan
2024-12-18
Abstract:High-dimensional data visualization is crucial in the big data era and these techniques such as t-SNE and UMAP have been widely used in science and engineering. Big data, however, is often distributed across multiple data centers and subject to security and privacy concerns, which leads to difficulties for the standard algorithms of t-SNE and UMAP. To tackle the challenge, this work proposes Fed-tSNE and Fed-UMAP, which provide high-dimensional data visualization under the framework of federated learning, without exchanging data across clients or sending data to the central server. The main idea of Fed-tSNE and Fed-UMAP is implicitly learning the distribution information of data in a manner of federated learning and then estimating the global distance matrix for t-SNE and UMAP. To further enhance the protection of data privacy, we propose Fed-tSNE+ and Fed-UMAP+. We also extend our idea to federated spectral clustering, yielding algorithms of clustering distributed data. In addition to these new algorithms, we offer theoretical guarantees of optimization convergence, distance and similarity estimation, and differential privacy. Experiments on multiple datasets demonstrate that, compared to the original algorithms, the accuracy drops of our federated algorithms are tiny.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to visualize distributed high - dimensional data in the federated learning framework without exchanging data or sending data to a central server. Specifically, the traditional t - SNE and UMAP algorithms face two main challenges when dealing with large - scale distributed data: 1. **Data distribution problem**: In many practical applications (such as mobile devices, Internet of Things networks, medical records, and social media platforms), high - dimensional data are usually distributed in multiple data centers and are restricted by security and privacy. This makes it necessary for different data centers or clients to share data or send it to a common central server, resulting in data privacy leakage and information security risks. 2. **Computational complexity problem**: The t - SNE and UMAP algorithms need to calculate the pairwise distances or similarities between all data points, which means that the data between different data sources must be shared with each other or uploaded to the central server for global calculation. This requirement not only increases the communication cost but may also be difficult to achieve due to the huge amount of data. To solve these problems, the paper proposes the following methods: - **Fed - tSNE and Fed - UMAP**: Implicitly learn the distribution information of data through the federated learning method and estimate the global distance matrix, thereby achieving distributed visualization of high - dimensional data. - **Fed - tSNE+ and Fed - UMAP+**: Further enhance privacy protection by introducing noise to protect data privacy. - **Extension to federated spectral clustering**: Apply the above ideas to the clustering tasks of distributed data, while providing theoretical guarantees, including optimization convergence, distance and similarity estimation, and differential privacy analysis. Through these methods, the paper aims to achieve a technical solution for efficient visualization of distributed high - dimensional data without violating data privacy.