Communication-Efficient Distributed Covariance Sketch, with Application to Distributed PCA

Zengfeng Huang,Xuemin Lin,Wenjie Zhang,Ying Zhang
2021-01-01
Journal of Machine Learning Research
Abstract:A sketch of a large data set captures vital properties of the original data while typically occupying much less space. In this paper, we consider the problem of computing a sketch of a massive data matrix A is an element of R-nxd that is distributed across s machines. Our goal is to output a matrix B is an element of R-lxd which is significantly smaller than but still approximates A well in terms of covariance error, i.e., parallel to A(T)A - (BB)-B-T parallel to(2). Such a matrix B is called a covariance sketch of A. We are mainly focused on minimizing the communication cost, which is arguably the most valuable resource in distributed computations. We show that there is a nontrivial gap between deterministic and randomized communication complexity for computing a covariance sketch. More specifically, we first prove an almost tight deterministic communication lower bound, then provide a new randomized algorithm with communication cost smaller than the deterministic lower bound. Based on a well-known connection between covariance sketch and approximate principle component analysis, we obtain better communication bounds for the distributed PCA problem. Moreover, we also give an improved distributed PCA algorithm for sparse input matrices, which uses our distributed sketching algorithm as a key building block.
What problem does this paper attempt to address?