Abstract:Support Vector Machine (SVM) regression is an important technique in data mining. The SVM training is expensive and its cost is dominated by: (i) the kernel value computation, and (ii) a search operation which finds extreme training data points for adjusting the regression function in every training iteration. Existing training algorithms for SVM regression are not scalable to large datasets because: (i) each training iteration repeatedly performs expensive kernel value computations, which is inefficient and requires holding the whole training dataset in memory; (ii) the search operation used in each training iteration considers the whole search space which is very expensive. In this article, we significantly improve the scalability and efficiency of SVM regression by exploiting the high performance of Graphics Processing Units (GPUs) and solid state drives (SSDs). Our key ideas are as follows. (i) To reduce the cost of repeated kernel value computations and avoid holding the whole training dataset in the GPU memory, we precompute all the kernel values and store them in the CPU memory extended by the SSD; together with an efficient strategy to read the precomputed kernel values, reusing precomputed kernel values with an efficient retrieval is much faster than computing them on-the-fly. This also alleviates the restriction that the training dataset has to fit into the GPU memory, and hence makes our algorithm scalable to large datasets, especially for large datasets with very high dimensionality. (ii) To enhance the performance of the frequently used search operation, we design an algorithm that minimizes the search space and the number of accesses to the GPU global memory; this optimized search algorithm also avoids branch divergence (one of the causes for poor performance) among GPU threads to achieve high utilization of the GPU resources. Our proposed techniques together form a scalable solution to the SVM regression which we call SIGMA. Our extensive experimental results show that SIGMA is highly efficient and can handle very large datasets which the state-of-the-art GPU-based algorithm cannot handle. On the datasets of size that the state-of-the-art GPU-based algorithm can handle, SIGMA consistently outperforms the state-of-the-art GPU-based algorithm by an order of magnitude and achieves up to 86 times speedup.

Support vector machine in big data: smoothing strategy and adaptive distributed inference

A Parallel and Scalable Digital Architecture for Training Support Vector Machines

A Novel Svm Modeling Approach For Highly Imbalanced And Overlapping Classification

Distributed Bootstrap Simultaneous Inference for High-Dimensional Quantile Regression

Incremental batch learning with support vector machines

Distributed Online Semi-Supervised Support Vector Machine

Learning Performance of Weighted Distributed Learning With Support Vector Machines

Using Support Vector Machines for Mining Regression Classes in Large Data Sets

A Load-Balancing Divide-and-Conquer SVM Solver.

Approximate Approach to Train SVM on Very Large Data Sets

Improvement of Support Vector Machine Algorithm in Big Data Background

A Fast-Convergence Distributed Support Vector Machine in Small-Scale Strongly Connected Networks

Scaling Support Vector Machines on Modern HPC Platforms

An Online Incremental Learning Support Vector Machine for Large-Scale Data

Distributed Estimation and Inference for Semi-parametric Binary Response Models

GADGET SVM: A Gossip-bAseD sub-GradiEnT Solver for Linear SVMs

Learning concepts from large scale imbalanced data sets using support cluster machines.

Parallelizing Support Vector Machines on Distributed Computers

Parallel and Distributed Structured SVM Training

Scalable and Fast SVM Regression Using Modern Hardware.

A sparse semismooth Newton based augmented Lagrangian method for large-scale support vector machines