Abstract:Count-sketch is a popular matrix sketching algorithm that can produce a sketch of an input data matrix X in O(nnz(X))time where nnz(X) denotes the number of non-zero entries in X. The sketched matrix will be much smaller than X while preserving most of its properties. Therefore, count-sketch is widely used for addressing high-dimensionality challenge in machine learning. However, there are two main limitations of count-sketch: (1) The sketching matrix used count-sketch is generated randomly which does not consider any intrinsic data properties of X. This data-oblivious matrix sketching method could produce a bad sketched matrix which will result in low accuracy for subsequent machine learning tasks (<a class="link-external link-http" href="http://e.g.classification" rel="external noopener nofollow">this http URL</a>); (2) For highly sparse input data, count-sketch could produce a dense sketched data matrix. This dense sketch matrix could make the subsequent machine learning tasks more computationally expensive than on the original sparse data X. To address these two limitations, we first show an interesting connection between count-sketch and k-means clustering by analyzing the reconstruction error of the count-sketch method. Based on our analysis, we propose to reduce the reconstruction error of count-sketch by using k-means clustering algorithm to obtain the low-dimensional sketched matrix. In addition, we propose to solve k-mean clustering using gradient descent with -L1 ball projection to produce a sparse sketched matrix. Our experimental results based on six real-life classification datasets have demonstrated that our proposed method achieves higher accuracy than the original count-sketch and other popular matrix sketching algorithms. Our results also demonstrate that our method produces a sparser sketched data matrix than other methods and therefore the prediction cost of our method will be smaller than other matrix sketching methods.

Scalable Kernel $K$-Means with Randomized Sketching: from Theory to Algorithm

Self-representative kernel concept factorization

Effective and Sparse Count-Sketch via k-means clustering

Accumulations of Projections--A Unified Framework for Random Sketches in Kernel Ridge Regression

Scalable Kernel Clustering: Approximate Kernel k-means

Scalable and Robust Community Detection with Randomized Sketching

Deterministic Clustering in High Dimensional Spaces: Sketches and Approximation

Nonparametric Testing under Randomized Sketching

Randomized sketch descent methods for non-separable linearly constrained optimization

Sparse Multiple Kernel Learning: Minimax Rates with Random Projection

JoinSketch: A Sketch Algorithm for Accurate and Unbiased Inner-Product Estimation.

Regularized Simple Multiple Kernel $k$-Means With Kernel Average Alignment

Sketching Algorithms for Sparse Dictionary Learning: PTAS and Turnstile Streaming

Sketch In, Sketch Out: Accelerating both Learning and Inference for Structured Prediction with Kernels

Learning the Positions in CountSketch

Sketching for Convex and Nonconvex Regularized Least Squares with Sharp Guarantees

Iterative Hessian sketch: Fast and accurate solution approximation for constrained least-squares

Efficient Matrix Sketching over Distributed Data

Inference in Randomized Least Squares and PCA via Normality of Quadratic Forms

Kernel k'-means algorithm for clustering analysis

Communication-efficient k-Means for Edge-based Machine Learning