Low-rank Label Propagation for Semi-supervised Learning with 100 Millions Samples

Raphael Petegrosso,Wei Zhang,Zhuliu Li,Yousef Saad,Rui Kuang
DOI: https://doi.org/10.48550/arXiv.1702.08884
2017-03-01
Abstract:The success of semi-supervised learning crucially relies on the scalability to a huge amount of unlabelled data that are needed to capture the underlying manifold structure for better classification. Since computing the pairwise similarity between the training data is prohibitively expensive in most kinds of input data, currently, there is no general ready-to-use semi-supervised learning method/tool available for learning with tens of millions or more data points. In this paper, we adopted the idea of two low-rank label propagation algorithms, GLNP (Global Linear Neighborhood Propagation) and Kernel Nyström Approximation, and implemented the parallelized version of the two algorithms accelerated with Nesterov's accelerated projected gradient descent for Big-data Label Propagation (BigLP). The parallel algorithms are tested on five real datasets ranging from 7000 to 10,000,000 in size and a simulation dataset of 100,000,000 samples. In the experiments, the implementation can scale up to datasets with 100,000,000 samples and hundreds of features and the algorithms also significantly improved the prediction accuracy when only a very small percentage of the data is labeled. The results demonstrate that the BigLP implementation is highly scalable to big data and effective in utilizing the unlabeled data for semi-supervised learning.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the scalability issue faced by large - scale semi - supervised learning when dealing with data sets containing tens of millions or even hundreds of millions of unlabeled data points. Specifically, current semi - supervised learning methods cannot be effectively applied to such large - scale data sets because calculating the pairwise similarity between training data is too expensive. Therefore, there is a lack of a general, ready - to - use tool or method to handle these large - scale data sets. To meet this challenge, the paper proposes a label propagation algorithm based on low - rank approximation and parallelization - BigLP (Big - data Label Propagation). By using two low - rank label propagation algorithms (GLNP and Kernel Nyström Approximation) and accelerating with Nesterov - accelerated projected gradient descent, the paper achieves an efficient label propagation method suitable for large - scale data sets. Experimental results show that this method can not only handle data sets with up to 100 million samples, but also significantly improve the prediction accuracy when only a small portion of the data is labeled. ### Key Problem Summary 1. **Large - scale Data Processing**: Traditional semi - supervised learning methods are difficult to handle large - scale data sets containing tens of millions or even hundreds of millions of unlabeled data points. 2. **High Computational Cost**: The cost of calculating the pairwise similarity matrix \( W \) is too high, which limits the application range of existing methods. 3. **Lack of Efficient Tools**: Currently, there are no off - the - shelf tools or methods that can effectively handle such large - scale data sets. ### Solution Overview - **Low - rank Approximation**: Reduce the storage and computational requirements of the similarity matrix through low - rank approximation. - **Parallelization Implementation**: Use parallel computing techniques to accelerate algorithm operation and enhance the ability to handle large - scale data. - **Acceleration Optimization Method**: Adopt optimization methods such as Nesterov - accelerated projected gradient descent to further improve algorithm efficiency. These improvements make BigLP more scalable and have better performance when dealing with large - scale data sets.