MC-MLP:Multiple Coordinate Frames in all-MLP Architecture for Vision

Zhimin Zhu,Jianguo Zhao,Tong Mu,Yuliang Yang,Mengyu Zhu
2023-04-08
Abstract:In deep learning, Multi-Layer Perceptrons (MLPs) have once again garnered attention from researchers. This paper introduces MC-MLP, a general MLP-like backbone for computer vision that is composed of a series of fully-connected (FC) layers. In MC-MLP, we propose that the same semantic information has varying levels of difficulty in learning, depending on the coordinate frame of features. To address this, we perform an orthogonal transform on the feature information, equivalent to changing the coordinate frame of features. Through this design, MC-MLP is equipped with multi-coordinate frame receptive fields and the ability to learn information across different coordinate frames. Experiments demonstrate that MC-MLP outperforms most MLPs in image classification tasks, achieving better performance at the same parameter level. The code will be available at: <a class="link-external link-https" href="https://github.com/ZZM11/MC-MLP" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main objective of this paper is to propose a new Multi-Coordinate Frame Transformation Architecture (MC-MLP) to address the limitations of existing deep learning models in handling image classification tasks. Specifically, the paper aims to address the following issues: 1. **Multi-Coordinate Frame Learning**: Traditional deep learning models, especially Transformer models based on self-attention mechanisms, typically acquire information in only one coordinate frame, which may lead to inefficient learning of certain semantic information. The paper hypothesizes that different semantic information may be more easily learned in different coordinate frames. Therefore, it proposes changing the coordinate frame of features through orthogonal transformations to enhance the model's learning ability. 2. **Simplified Model Structure**: Compared to the complex Transformer structure, MC-MLP adopts a more straightforward design, achieving efficient feature extraction and information interaction through simple fully connected layers and multi-coordinate frame transformations (such as Discrete Cosine Transform (DCT) and Hadamard Transform). This design not only simplifies the model structure but also improves computational efficiency. 3. **Cross-Domain Information Interaction**: By combining spatial domain features with transformed domain features, MC-MLP can better understand image data from multiple perspectives, thereby enhancing the overall performance of the model. This cross-domain information interaction helps the model capture various relationships within the image more comprehensively. 4. **Efficiency and Robustness**: Experimental results show that MC-MLP outperforms recent ViT and MLP models on the CIFAR-100 dataset and is competitive in terms of similar parameter counts and computational complexity. This indicates that MC-MLP, as an effective alternative for vision tasks, has significant advantages in efficiency, generalization ability, and robustness.