Low-rank Label Propagation for Semi-supervised Learning with 100 Millions Samples

Raphael Petegrosso,Wei Zhang,Zhuliu Li,Yousef Saad,Rui Kuang

DOI: https://doi.org/10.48550/arXiv.1702.08884

2017-03-01

Abstract:The success of semi-supervised learning crucially relies on the scalability to a huge amount of unlabelled data that are needed to capture the underlying manifold structure for better classification. Since computing the pairwise similarity between the training data is prohibitively expensive in most kinds of input data, currently, there is no general ready-to-use semi-supervised learning method/tool available for learning with tens of millions or more data points. In this paper, we adopted the idea of two low-rank label propagation algorithms, GLNP (Global Linear Neighborhood Propagation) and Kernel Nyström Approximation, and implemented the parallelized version of the two algorithms accelerated with Nesterov's accelerated projected gradient descent for Big-data Label Propagation (BigLP). The parallel algorithms are tested on five real datasets ranging from 7000 to 10,000,000 in size and a simulation dataset of 100,000,000 samples. In the experiments, the implementation can scale up to datasets with 100,000,000 samples and hundreds of features and the algorithms also significantly improved the prediction accuracy when only a very small percentage of the data is labeled. The results demonstrate that the BigLP implementation is highly scalable to big data and effective in utilizing the unlabeled data for semi-supervised learning.

Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the scalability issue faced by large - scale semi - supervised learning when dealing with data sets containing tens of millions or even hundreds of millions of unlabeled data points. Specifically, current semi - supervised learning methods cannot be effectively applied to such large - scale data sets because calculating the pairwise similarity between training data is too expensive. Therefore, there is a lack of a general, ready - to - use tool or method to handle these large - scale data sets. To meet this challenge, the paper proposes a label propagation algorithm based on low - rank approximation and parallelization - BigLP (Big - data Label Propagation). By using two low - rank label propagation algorithms (GLNP and Kernel Nyström Approximation) and accelerating with Nesterov - accelerated projected gradient descent, the paper achieves an efficient label propagation method suitable for large - scale data sets. Experimental results show that this method can not only handle data sets with up to 100 million samples, but also significantly improve the prediction accuracy when only a small portion of the data is labeled. ### Key Problem Summary 1. **Large - scale Data Processing**: Traditional semi - supervised learning methods are difficult to handle large - scale data sets containing tens of millions or even hundreds of millions of unlabeled data points. 2. **High Computational Cost**: The cost of calculating the pairwise similarity matrix \( W \) is too high, which limits the application range of existing methods. 3. **Lack of Efficient Tools**: Currently, there are no off - the - shelf tools or methods that can effectively handle such large - scale data sets. ### Solution Overview - **Low - rank Approximation**: Reduce the storage and computational requirements of the similarity matrix through low - rank approximation. - **Parallelization Implementation**: Use parallel computing techniques to accelerate algorithm operation and enhance the ability to handle large - scale data. - **Acceleration Optimization Method**: Adopt optimization methods such as Nesterov - accelerated projected gradient descent to further improve algorithm efficiency. These improvements make BigLP more scalable and have better performance when dealing with large - scale data sets.

Low-rank Label Propagation for Semi-supervised Learning with 100 Millions Samples

Graph Learning on Millions of Data in Seconds: Label Propagation Acceleration on Graph Using Data Distribution

Lightweight Label Propagation for Large-Scale Network Data

Hypergraph Label Propagation Network.

Label Propagation Through Linear Neighborhoods

Large-Scale Multilabel Propagation Based on Efficient Sparse Graph Construction

Label Propagated Nonnegative Matrix Factorization for Clustering

Semi-supervised imbalanced multi-label classification with label propagation

Joint Label Propagation, Graph and Latent Subspace Estimation for Semi-supervised Classification

Using Cluster Information to Improve Label Propagation

Semi-Supervised Learning with Close-Form Label Propagation Using a Bipartite Graph

ST-LP: self-training and label propagation for semi-supervised classification

Semi-supervised deep learning based on label propagation in a 2D embedded space

A Semi-supervised Kernel Learning Method Based on Label Propagation

Semi-supervised Image Classification Via Nonnegative Least-Squares Regression

Efficient large-scale image annotation by probabilistic collaborative multi-label propagation.

Towards Robust Graph Neural Networks against Label Noise

Hybrid Approach for Inductive Semi Supervised Learning using Label Propagation and Support Vector Machine

Graph-based Semi-Supervised Learning by Mixed Label Propagation with a Soft Constraint.

Semi-Supervised Classification Using Linear Neighborhood Propagation

Efficient Region-Aware Large Graph Construction Towards Scalable Multi-Label Propagation