Abstract:Count-sketch is a popular matrix sketching algorithm that can produce a sketch of an input data matrix X in O(nnz(X))time where nnz(X) denotes the number of non-zero entries in X. The sketched matrix will be much smaller than X while preserving most of its properties. Therefore, count-sketch is widely used for addressing high-dimensionality challenge in machine learning. However, there are two main limitations of count-sketch: (1) The sketching matrix used count-sketch is generated randomly which does not consider any intrinsic data properties of X. This data-oblivious matrix sketching method could produce a bad sketched matrix which will result in low accuracy for subsequent machine learning tasks (<a class="link-external link-http" href="http://e.g.classification" rel="external noopener nofollow">this http URL</a>); (2) For highly sparse input data, count-sketch could produce a dense sketched data matrix. This dense sketch matrix could make the subsequent machine learning tasks more computationally expensive than on the original sparse data X. To address these two limitations, we first show an interesting connection between count-sketch and k-means clustering by analyzing the reconstruction error of the count-sketch method. Based on our analysis, we propose to reduce the reconstruction error of count-sketch by using k-means clustering algorithm to obtain the low-dimensional sketched matrix. In addition, we propose to solve k-mean clustering using gradient descent with -L1 ball projection to produce a sparse sketched matrix. Our experimental results based on six real-life classification datasets have demonstrated that our proposed method achieves higher accuracy than the original count-sketch and other popular matrix sketching algorithms. Our results also demonstrate that our method produces a sparser sketched data matrix than other methods and therefore the prediction cost of our method will be smaller than other matrix sketching methods.

Efficient Matrix Sketching over Distributed Data

Communication-Efficient Distributed Covariance Sketch, with Application to Distributed PCA

Distributed High-Dimension Matrix Operation Optimization on Spark

Distributed Least Squares in Small Space via Sketching and Bias Reduction

Near Optimal Frequent Directions for Sketching Dense and Sparse Matrices

Robust Covariance Estimation for Distributed Principal Component Analysis

Seeing the Forest from the Trees in Two Looks: Matrix Sketching by Cascaded Bilateral Sampling

Localized sketching for matrix multiplication and ridge regression

Matrix Sketching in Bandits: Current Pitfalls and New Framework

Randomization or Condensation?: Linear-Cost Matrix Sketching Via Cascaded Compression Sampling.

Effective and Sparse Count-Sketch via k-means clustering

Optimal Matrix Sketching over Sliding Windows

On Sketching Quadratic Forms

Distributed estimation of principal eigenspaces

Tight Bounds for the Subspace Sketch Problem with Applications

Sketching for First Order Method: Efficient Algorithm for Low-Bandwidth Channel and Vulnerability

Statistical properties of sketching algorithms

Scalable and Robust Community Detection with Randomized Sketching

Efficient Sparse PCA via Block-Diagonalization

Distributed Estimation for Principal Component Analysis: An Enlarged Eigenspace Analysis

Approximate Multiplication of Sparse Matrices with Limited Space