Abstract:The application of Support Vector Machine (SVM) over data stream is growing with the increasing real-time processing requirements in classification field, like anomaly detection and real-time image processing. However, the dynamic live data with high volume and fast arrival rate in data streams make it challenging to apply SVM in data stream processing. Existing SVM implementations are mostly designed for batch processing and hardly satisfy the efficiency requirement of stream processing for its inherent complexity. To address the challenges, we propose a high efficiency distributed SVM framework over data stream (HDSVM), which consists of two main algorithms, incremental learning algorithm and distributed algorithm. Firstly, we propose a partial support vectors reserving incremental learning algorithm (PSVIL). By selecting a subset of support vectors based on their distances to classification hyperplane instead of the universal set to update SVM, the algorithm achieves lower time overhead while ensuring accuracy. Secondly, we propose a distribution remaining partition and fast aggregation distributed algorithm (DRPFA) for SVM. The real-time data is partitioned based on the original distribution with clustering instead of random partition, and historical support vectors are partitioned based on their distances to the classification hyperplane. The global hyperplane can be obtained by averaging the parameters of local hyperplanes due to the above partition strategy. Extensive experiments on Apache Storm show that the proposed HDSVM achieve lower time overhead and similar accuracy compared with the state-of-art. Speed-up ratio is increased by 2-8 times within 1% accuracy deviation.

A MapReduce-Based Distributed SVM for Scalable Data Type Classification.

MapReduce-based scalable architecture for implementing SVM in hardware

A Distributed SVM Method Based on the Iterative MapReduce

High-Performance Support Vector Machines and Its Applications.

HDSVM: A High Efficiency Distributed SVM Framework over Data Stream.

RESEARCH ON CASCADE-GROUPING PARALLEL SVM ALGORITHM BASED ON MAPREDUCE

A Parallel Incremental Extreme SVM Classifier

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

Parallelizing Support Vector Machines on Distributed Computers

A Resource Aware MapReduce Based Parallel SVM for Large Scale Image Classifications

Parallel network traffic classification method based on SVM

Accelerating Support Vector Machine Learning With Gpu-Based Mapreduce

An Improved Parallel SVM Algorithm on Distributed System

Improved classification approach for use with large-scale scene images in the Hadoop cluster environment.

Distributed Online Semi-Supervised Support Vector Machine

Optimization of Multi Kernel Parallel Support Vector Machine Based on Hadoop

A Parallel SVM Training Algorithm on Large-Scale Classification Problems

Large-scale support vector machine classification with redundant data reduction

An intelligent system for accelerating parallel SVM classification problems on large datasets using GPU

Multiple Submodels Parallel Support Vector Machine on Spark

A distributed approach for large-scale classifier training and image classification