Abstract:In deep learning, Multi-Layer Perceptrons (MLPs) have once again garnered attention from researchers. This paper introduces MC-MLP, a general MLP-like backbone for computer vision that is composed of a series of fully-connected (FC) layers. In MC-MLP, we propose that the same semantic information has varying levels of difficulty in learning, depending on the coordinate frame of features. To address this, we perform an orthogonal transform on the feature information, equivalent to changing the coordinate frame of features. Through this design, MC-MLP is equipped with multi-coordinate frame receptive fields and the ability to learn information across different coordinate frames. Experiments demonstrate that MC-MLP outperforms most MLPs in image classification tasks, achieving better performance at the same parameter level. The code will be available at: <a class="link-external link-https" href="https://github.com/ZZM11/MC-MLP" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main objective of this paper is to propose a new Multi-Coordinate Frame Transformation Architecture (MC-MLP) to address the limitations of existing deep learning models in handling image classification tasks. Specifically, the paper aims to address the following issues: 1. **Multi-Coordinate Frame Learning**: Traditional deep learning models, especially Transformer models based on self-attention mechanisms, typically acquire information in only one coordinate frame, which may lead to inefficient learning of certain semantic information. The paper hypothesizes that different semantic information may be more easily learned in different coordinate frames. Therefore, it proposes changing the coordinate frame of features through orthogonal transformations to enhance the model's learning ability. 2. **Simplified Model Structure**: Compared to the complex Transformer structure, MC-MLP adopts a more straightforward design, achieving efficient feature extraction and information interaction through simple fully connected layers and multi-coordinate frame transformations (such as Discrete Cosine Transform (DCT) and Hadamard Transform). This design not only simplifies the model structure but also improves computational efficiency. 3. **Cross-Domain Information Interaction**: By combining spatial domain features with transformed domain features, MC-MLP can better understand image data from multiple perspectives, thereby enhancing the overall performance of the model. This cross-domain information interaction helps the model capture various relationships within the image more comprehensively. 4. **Efficiency and Robustness**: Experimental results show that MC-MLP outperforms recent ViT and MLP models on the CIFAR-100 dataset and is competitive in terms of similar parameter counts and computational complexity. This indicates that MC-MLP, as an effective alternative for vision tasks, has significant advantages in efficiency, generalization ability, and robustness.

MC-MLP:Multiple Coordinate Frames in all-MLP Architecture for Vision

ConvMLP: Hierarchical Convolutional MLPs for Vision

MLP Architectures for Vision-and-Language Modeling: An Empirical Study

R2-MLP: Round-Roll MLP for Multi-View 3D Object Recognition

X-MLP: A Patch Embedding-Free MLP Architecture for Vision

Hire-MLP: Vision MLP Via Hierarchical Rearrangement

MAXIM: Multi-Axis MLP for Image Processing

Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition

Geometrical Interpretation and Design of Multilayer Perceptrons

Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs

RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality

Video-Mlp: Convolution-Free, Attention-Free Architecture for Video Classification

Multi-Scale MLP-Mixer for image classification

Mesh-MLP: an All-Mlp Architecture for Mesh Classification and Semantic Segmentation

Rethinking Token-Mixing MLP for MLP-based Vision Backbone

SpiralMLP: A Lightweight Vision MLP Architecture

MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing

Full-resolution MLPs Empower Medical Dense Prediction

S$^2$-MLP: Spatial-Shift MLP Architecture for Vision

SS-MLP: A Novel Spectral-Spatial MLP Architecture for Hyperspectral Image Classification

CS-Mixer: A Cross-Scale Vision MLP Model with Spatial-Channel Mixing