Online Learning From Incomplete and Imbalanced Data Streams
Dianlong You,Jiawei Xiao,Yang Wang,Huigui Yan,Di Wu,Zhen Chen,Limin Shen,Xindong Wu
DOI: https://doi.org/10.1109/tkde.2023.3250472
IF: 9.235
2023-01-01
IEEE Transactions on Knowledge and Data Engineering
Abstract:Learning with streaming data has attracted extensive research interest in recent years. Existing online learning approaches have specific assumptions regarding data streams, such as requiring fixed or varying feature spaces with explicit patterns and balanced class distributions. While the data streams generated in many real scenarios commonly have arbitrarily incomplete feature spaces and dynamic imbalanced class distributions, making existing approaches be unsuitable for real applications. To address this issue, this paper proposes a novel Online Learning from Incomplete and Imbalanced Data Streams (OLI $^{2}$ DS) algorithm. OLI $^{2}$ DS has a two-fold main idea: 1) it follows the empirical risk minimization principle to identify the most informative features of incomplete feature spaces, and 2) it develops a dynamic cost strategy to handle imbalanced class distributions in real-time by transforming F-measure optimization into a weighted surrogate loss minimization. To evaluate OLI $^{2}$ DS, we compare it with state-of-the-art related algorithms in three kinds of experiments. First, we adopt 14 real datasets to simulate three scenarios of incomplete feature spaces, i.e., trapezoidal, feature evolvable, and capricious data streams. Second, based on a benchmark online analyzer, we generate 13 datasets to simulate incomplete data streams with different imbalance ratios. Third, we analyze concept drift in two simulated scenes, i.e., online learning and data stream mining, and verify the adaption of OLI $^{2}$ DS on repeated concept drifts and variable imbalance ratios. The results demonstrate that OLI $^{2}$ DS achieves a significantly better performance than its rivals. Besides, a real-world case study on movie review classification is conducted to elaborate on our OLI $^{2}$ DS algorithm's effectiveness. Code is released at https://github.com/youdianlong/OLI2DS.
computer science, information systems, artificial intelligence,engineering, electrical & electronic