MSEC: Multi-Scale Erasure and Confusion for fine-grained image classification

Yan Zhang,Yongsheng Sun,Nian Wang,Zijian Gao,Feng Chen,Chenfei Wang,Jun Tang
DOI: https://doi.org/10.1016/j.neucom.2021.03.114
IF: 6
2021-08-01
Neurocomputing
Abstract:<p>With the rapid development of deep learning, the performance of fine-grained image classification has experienced unprecedented improvement. However, for fine-grained image classification, quickly and effectively focusing on subtle discriminative details that make the sub-classes different from each other has always been challenging. In this paper, we propose a novel Multi-Scale Erasure and Confusion (MSEC) method to tackle the challenge of fine-grained image classification. Firstly, the input image is divided into several sub-regions, and the confidence scores of those sub-regions are calculated by the confidence function. The sub-regions with lower confidence scores are then erased by the Region Erasure Module (REM) and the erased image is confused once by the Multi-scale Region Confusion Module (Multi-scale RCM). Secondly, the sub-regions with higher confidence scores are divided and confused again by the Multi-scale RCM, and then generate an image with multi-scale information. Finally, features in the erased image and the "destructed" image are extracted by the backbone network, and the whole network is optimized by the multi-loss function to realize classification tasks. Extensive experiments on three standard fine-grained benchmark datasets, including Stanford Dogs, CUB-200-2011 and FGVC-Aircraft, show that MSEC can improve the accuracy of fine-grained image classification.</p>
computer science, artificial intelligence
What problem does this paper attempt to address?
The paper primarily addresses the challenges in fine-grained image classification by proposing a new method called Multi-Scale Erasure and Confusion (MSEC). ### Research Background and Problem In fine-grained image classification, identifying subtle differences between different subcategories is a key challenge. These images have small inter-class differences and large intra-class differences, such as different breeds of dogs, birds, etc. Current methods are mainly divided into two categories: one requires manually annotating key areas in the images, which is resource-intensive and difficult to scale; the other uses attention mechanisms to automatically locate discriminative regions, but this method increases the computational load of the network. ### Overview of the MSEC Method The MSEC method aims to address the above issues through two key modules: 1. **Region Erasure Module (REM)**: The input image is evenly divided into multiple sub-regions, and each sub-region is scored based on a confidence function. Sub-regions with lower scores are considered to contain redundant information and are erased. This helps the network better extract detailed features of the target object. 2. **Multi-scale Region Confusion Module (Multi-scale RCM)**: The erased image is randomly confused, disrupting the overall structure of the original image. Additionally, high-scoring sub-regions are further divided and confused to generate images containing information at different scales. This multi-scale confusion helps the network focus more on discriminative local details of the target object. ### Main Contributions 1. REM can effectively remove redundant information from the image, retain information useful for classification, and enhance the network's ability to learn representative features of the target object. 2. Multi-scale RCM confuses images and high-scoring sub-regions at different scales, highlighting the detailed textures of the target object, further improving the network's ability to mine discriminative visual cues. 3. The MSEC network adds almost no extra parameters. Experimental results on three standard fine-grained datasets show that this method achieves more competitive classification results compared to existing techniques. ### Conclusion In summary, the MSEC method is a concise and effective solution for fine-grained image classification. It does not require additional part-object annotations nor relies on attention models, thereby reducing computational costs and improving classification performance.