Globalized Multiple Balanced Subsets With Collaborative Learning for Imbalanced Data

Zonghai Zhu,Zhe Wang,Dongdong Li,Wenli Du
DOI: https://doi.org/10.1109/tcyb.2020.3001158
IF: 11.8
2022-04-01
IEEE Transactions on Cybernetics
Abstract:The skewed distribution of data brings difficulties to classify minority and majority samples in the imbalanced problem. The balanced bagging randomly undersampes majority samples several times and combines the selected majority samples with minority samples to form several balanced subsets, in which the numbers of minority and majority samples are roughly equal. However, the balanced bagging is the lack of a unified learning framework. Moreover, it fails to concern the connection of all subsets and the global information of the entire data distribution. To this end, this article puts several balanced subsets into an effective learning framework with a criterion function. In the learning framework, one regularization term called <span class="mjpage"><svg xmlns:xlink="http://www.w3.org/1999/xlink" width="3.056ex" height="2.509ex" style="vertical-align: -0.671ex;" viewBox="0 -791.3 1315.9 1080.4" role="img" focusable="false" xmlns="http://www.w3.org/2000/svg"><g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)"> <use xlink:href="#MJMATHI-52" x="0" y="0"></use> <use transform="scale(0.707)" xlink:href="#MJMATHI-53" x="1074" y="-219"></use></g></svg></span> establishes the connection and realizes the collaborative learning of all subsets by requiring the consistent outputs of the minority samples in different subsets. Besides, another regularization term called <span class="mjpage"><svg xmlns:xlink="http://www.w3.org/1999/xlink" width="3.718ex" height="2.509ex" style="vertical-align: -0.671ex;" viewBox="0 -791.3 1600.9 1080.4" role="img" focusable="false" xmlns="http://www.w3.org/2000/svg"><g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)"> <use xlink:href="#MJMATHI-52" x="0" y="0"></use> <use transform="scale(0.707)" xlink:href="#MJMATHI-57" x="1074" y="-213"></use></g></svg></span> provides the global information to each basic classifier by reducing the difference between the direction of the solution vector in each subset and that in the entire dataset. The proposed learning framework is called globalized multiple balanced subsets with collaborative learning (GMBSCL). The experimental results validate the effectiveness of the proposed GMBSCL.<svg xmlns="http://www.w3.org/2000/svg" style="display: none;"><defs id="MathJax_SVG_glyphs"><path stroke-width="1" id="MJMATHI-52" d="M230 637Q203 637 198 638T193 649Q193 676 204 682Q206 683 378 683Q550 682 564 680Q620 672 658 652T712 606T733 563T739 529Q739 484 710 445T643 385T576 351T538 338L545 333Q612 295 612 223Q612 212 607 162T602 80V71Q602 53 603 43T614 25T640 16Q668 16 686 38T712 85Q717 99 720 102T735 105Q755 105 755 93Q755 75 731 36Q693 -21 641 -21H632Q571 -21 531 4T487 82Q487 109 502 166T517 239Q517 290 474 313Q459 320 449 321T378 323H309L277 193Q244 61 244 59Q244 55 245 54T252 50T269 48T302 46H333Q339 38 339 37T336 19Q332 6 326 0H311Q275 2 180 2Q146 2 117 2T71 2T50 1Q33 1 33 10Q33 12 36 24Q41 43 46 45Q50 46 61 46H67Q94 46 127 49Q141 52 146 61Q149 65 218 339T287 628Q287 635 230 637ZM630 554Q630 586 609 608T523 636Q521 636 500 636T462 637H440Q393 637 386 627Q385 624 352 494T319 361Q319 360 388 360Q466 361 492 367Q556 377 592 426Q608 449 619 486T630 554Z"></path><path stroke-width="1" id="MJMATHI-53" d="M308 24Q367 24 416 76T466 197Q466 260 414 284Q308 311 278 321T236 341Q176 383 176 462Q176 523 208 573T273 648Q302 673 343 688T407 704H418H425Q521 704 564 640Q565 640 577 653T603 682T623 704Q624 704 627 704T632 705Q645 705 645 698T617 577T585 459T569 456Q549 456 549 465Q549 471 550 475Q550 478 551 494T553 520Q553 554 544 579T526 616T501 641Q465 662 419 662Q362 662 313 616T263 510Q263 480 278 458T319 427Q323 425 389 408T456 390Q490 379 522 342T554 242Q554 216 546 186Q541 164 528 137T492 78T426 18T332 -20Q320 -22 298 -22Q199 -22 144 33L134 44L106 13Q83 -14 78 -18T65 -22Q52 -22 52 -14Q52 -11 110 221Q112 227 130 227H143Q149 221 149 216Q149 214 148 207T144 186T142 153Q144 114 160 87T203 47T255 29T308 24Z"></path><path stroke-width="1" id="MJMATHI-57" d="M436 683Q450 683 486 682T553 680Q604 680 638 681T677 682Q695 682 695 674Q695 670 692 659Q687 641 683 639T661 637Q636 636 621 632T600 624T597 615Q597 603 613 377T629 138L631 141Q633 144 637 151T649 170T666 200T690 241T720 295T759 362Q863 546 877 572T892 604Q892 619 873 628T831 637Q817 637 817 647Q817 650 819 660Q823 676 825 679T839 682Q842 682 856 682T895 682T949 681Q1015 681 1034 683Q1048 683 1048 672Q1048 666 1045 655T1038 640T1028 637Q1006 637 988 631T958 617T939 600T927 584L923 578L754 282Q586 -14 585 -15Q579 -22 561 -22Q546 -22 542 -17Q539 -14 523 229T506 480L494 462Q472 425 366 239Q222 -13 220 -15T215 -19Q210 -22 197 -22Q178 -22 176 -15Q176 -12 154 304T131 622Q129 631 121 633T82 637H58Q51 644 51 648Q52 671 64 683H76Q118 680 176 680Q301 680 313 683H323Q329 677 329 674T327 656Q322 641 318 637H297Q236 634 232 620Q262 160 266 136L501 550L499 587Q496 629 489 632Q483 636 447 637Q428 637 422 639T416 648Q416 650 418 660Q419 664 420 669T421 676T424 680T428 682T436 683Z"></path></defs></svg>
automation & control systems,computer science, cybernetics, artificial intelligence
What problem does this paper attempt to address?
### The Problem Addressed by the Paper This paper aims to address the issue of poor classifier performance when dealing with imbalanced datasets. Specifically: 1. **Imbalanced Dataset Problem**: - In real life, imbalanced datasets are common, such as the disparity between the number of healthy individuals and patients in medical diagnosis, or the small proportion of actual fraud cases in fraud detection. - Traditional classifiers typically use overall accuracy as the evaluation standard, which tends to overlook the classification of minority class samples, leading to poor classification results. 2. **Limitations of Existing Methods**: - The Balanced Bagging method, although it forms balanced subsets by randomly undersampling majority class samples and combining them with minority class samples, lacks a unified learning framework. There is a lack of correlation between subsets, and it ignores global information. 3. **Proposed New Method**: - The paper proposes a new framework called "Globalized Multiple Balanced Subsets with Collaborative Learning" (GMBSCL) to address the classification problem of imbalanced datasets. - GMBSCL places multiple balanced subsets into a unified learning framework and introduces two regularization terms (R_S and R_W) to achieve collaborative learning among basic classifiers of subsets and to utilize global information. ### Summary The main goal of this paper is to overcome the limitations of existing Balanced Bagging methods and improve classification performance on imbalanced datasets by proposing a new learning framework (GMBSCL).