Efficient Matrix Sketching over Distributed Data
Zengfeng Huang,Xuemin Lin,Wenjie Zhang,Ying Zhang
DOI: https://doi.org/10.1145/3034786.3056119
2017-01-01
Abstract:A sketch or synopsis of a large dataset captures vital properties of the original data while typically occupying much less space. In this paper, we consider the problem of computing a sketch of a massive data matrix A ∈ℜnxd, which is distributed across a large number of s servers. Our goal is to output a matrix B∈ℜℓ x d which is significantly smaller than but still approximates A well in terms of covariance error, i.e., ||ATA-BTB||2||. Here, for a matrix A, ||A||2|| is the spectral norm of A, which is defined as the largest singular value of A. Following previous works, we call B a covariance sketch of A. We are mainly focused on minimizing the communication cost, which is arguably the most valuable resource in distributed computations. We show a gap between deterministic and randomized communication complexity for computing a covariance sketch. More specifically, we first prove a tight deterministic lower bound, then show how to bypass this lower bound using randomization. In Principle Component Analysis (PCA), the goal is to find a low-dimensional subspace that captures as much of the variance of a dataset as possible. Based on a well-known connection between covariance sketch and PCA, we give a new algorithm for distributed PCA with improved communication cost. Moreover, in our algorithms, each server only needs to make one pass over the data with limited working space.