A Framework of Sparse Online Learning and Its Applications

Dayong Wang,Pengcheng Wu,Peilin Zhao,Steven C.H. Hoi
DOI: https://doi.org/10.48550/arXiv.1507.07146
2015-07-26
Abstract:The amount of data in our society has been exploding in the era of big data today. In this paper, we address several open challenges of big data stream classification, including high volume, high velocity, high dimensionality, high sparsity, and high class-imbalance. Many existing studies in data mining literature solve data stream classification tasks in a batch learning setting, which suffers from poor efficiency and scalability when dealing with big data. To overcome the limitations, this paper investigates an online learning framework for big data stream classification tasks. Unlike some existing online data stream classification techniques that are often based on first-order online learning, we propose a framework of Sparse Online Classification (SOC) for data stream classification, which includes some state-of-the-art first-order sparse online learning algorithms as special cases and allows us to derive a new effective second-order online learning algorithm for data stream classification. In addition, we also propose a new cost-sensitive sparse online learning algorithm by extending the framework with application to tackle online anomaly detection tasks where class distribution of data could be very imbalanced. We also analyze the theoretical bounds of the proposed method, and finally conduct an extensive set of experiments, in which encouraging results validate the efficacy of the proposed algorithms in comparison to a family of state-of-the-art techniques on a variety of data stream classification tasks.
Machine Learning
What problem does this paper attempt to address?
This paper attempts to address several open challenges in big data stream classification tasks, including high volume, high velocity, high dimensionality, high sparsity and high class - imbalance. Most methods in the traditional data mining literature use batch - learning settings to solve data stream classification tasks, but they are inefficient and have poor scalability when dealing with big data. To solve these problems, this paper proposes an online learning framework, especially for Sparse Online Classification (SOC). ### Main contributions of the paper: 1. **Propose a generalized online learning framework**: This framework can easily derive first - order and second - order algorithms. 2. **Provide theoretical analysis**: Including general regret bounds and error bounds. 3. **Evaluate on multiple high - dimensional large - scale benchmark databases**: The experimental results show that the proposed algorithm achieves state - of - the - art performance. ### Specific problem descriptions: - **High Volume**: It is required to process training data on the scale of millions or even billions. - **High Velocity**: New data arrives sequentially at an extremely fast speed. For example, about 182.9 billion emails are sent / received globally every day. - **High Dimensionality**: The number of features is huge. For example, in the spam classification task, the length of the vocabulary can reach 10,000 to 500,000 or even more. - **High Sparsity**: Many feature elements are zero, and the proportion of active features is usually very small. - **High Class - Imbalance**: The number of samples in some classes is far more than that in other classes. For example, the number of non - spam emails is far more than that of spam emails. ### Solutions: - **Sparse Online Learning Framework**: Improve computational efficiency and memory usage efficiency by introducing sparsity. - **First - order and second - order algorithms**: Not only cover the existing first - order sparse online classification algorithms, but also derive new second - order online learning algorithms. - **Cost - sensitive sparse online learning algorithm**: Used to handle data stream classification tasks with unbalanced class distributions, such as online anomaly detection. ### Experimental verification: The paper verifies the effectiveness of the proposed algorithm through extensive experiments and shows superior performance compared with a series of state - of - the - art techniques. ### Summary: This paper aims to solve the key challenges in big data stream classification tasks by introducing a sparse online learning framework, and shows the effectiveness and superiority of this framework through theoretical analysis and experimental verification.