Decentralized Online Big Data Classification - a Bandit Framework

Cem Tekin,Mihaela van der Schaar
DOI: https://doi.org/10.48550/arXiv.1308.4565
2013-08-25
Abstract:Distributed, online data mining systems have emerged as a result of applications requiring analysis of large amounts of correlated and high-dimensional data produced by multiple distributed data sources. We propose a distributed online data classification framework where data is gathered by distributed data sources and processed by a heterogeneous set of distributed learners which learn online, at run-time, how to classify the different data streams either by using their locally available classification functions or by helping each other by classifying each other's data. Importantly, since the data is gathered at different locations, sending the data to another learner to process incurs additional costs such as delays, and hence this will be only beneficial if the benefits obtained from a better classification will exceed the costs. We assume that the classification functions available to each processing element are fixed, but their prediction accuracy for various types of incoming data are unknown and can change dynamically over time, and thus they need to be learned online. We model the problem of joint classification by the distributed and heterogeneous learners from multiple data sources as a distributed contextual bandit problem where each data is characterized by a specific context. We develop distributed online learning algorithms for which we can prove that they have sublinear regret. Compared to prior work in distributed online data mining, our work is the first to provide analytic regret results characterizing the performance of the proposed algorithms.
Machine Learning,Multiagent Systems
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively perform data classification among multiple distributed and heterogeneous learners in distributed online big data classification. Specifically, the research aims to design a framework and algorithm so that these learners can learn at runtime (i.e., online) how to classify high - dimensional data streams from different data sources. Each learner can use locally available classification functions or achieve this goal by helping each other (for example, sending data to other learners for classification). The key challenges include: - Data is collected by data sources distributed in different locations, so sending data to other learners for processing will incur additional costs (such as latency), which is only advantageous when the benefits of better classification outweigh the costs. - The classification functions available to each processing unit are fixed, but their prediction accuracy for different types of input data is unknown and may change dynamically over time, so these accuracies need to be learned online. - In order to optimize the overall performance of the distributed data mining system, the author models the problem as a cooperative contextual bandit problem and develops distributed online learning algorithms, which can be proven to have sublinear regret bounds, meaning that their average rewards will converge to the optimal average rewards. In short, the main problem this paper solves is how to efficiently perform online big data classification in a distributed environment while minimizing losses due to uncertainty and communication costs. Mathematically, the goal of each learner \(i\) is to maximize the long - term expected total reward, that is, the expected number of correct labels minus the classification cost. It can be expressed mathematically as: \[ \max \mathbb{E}\left[\sum_{t = 1}^{T}(\pi_k(x_i(t))-d_k)\right] \] where: - \(\pi_k(x_i(t))\) represents the probability of correct classification using the classification function \(k\) under the context \(x_i(t)\). - \(d_k\) represents the cost of using the classification function \(k\). - \(x_i(t)\) is the context information of the data arriving at learner \(i\) at time \(t\). In addition, the paper also solves the following sub - problems: - How to gradually improve classification accuracy through online learning without knowing the accuracy of classification functions in advance. - How to decide when to forward data to other learners to obtain better classification results while considering communication and computational costs. - How to handle the time - varying characteristics of data streams and context information (i.e., concept drift).