Abstract:Recently, the Vision Transformer (ViT) model has replaced the classical Convolutional Neural Network (ConvNet) in various computer vision tasks due to its superior performance. Even in hyperspectral image (HSI) classification field, ViT-based methods also show promising potential. Nevertheless, ViT encounters notable difficulties in processing HSI data. Its self-attention mechanism, which exhibits quadratic complexity, escalates computational costs. Additionally, ViT's substantial demand for training samples does not align with the practical constraints posed by the expensive labeling of HSI data. To overcome these challenges, we propose a 3D relational ConvNet named 3D-RCNet, which inherits both strengths of ConvNet and ViT, resulting in high performance in HSI classification. We embed the self-attention mechanism of Transformer into the convolutional operation of ConvNet to design 3D relational convolutional operation and use it to build the final 3D-RCNet. The proposed 3D-RCNet maintains the high computational efficiency of ConvNet while enjoying the flexibility of ViT. Additionally, the proposed 3D relational convolutional operation is a plug-and-play operation, which can be inserted into previous ConvNet-based HSI classification methods seamlessly. Empirical evaluations on three representative benchmark HSI datasets show that the proposed model outperforms previous ConvNet-based and ViT-based HSI approaches.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key challenges in hyperspectral image (HSI) classification, especially the difficulties encountered when combining convolutional neural networks (ConvNet) and visual transformers (ViT). Specifically, the paper attempts to solve the following problems: 1. **Computational complexity problem**: - The self - attention mechanism of ViT has quadratic complexity (\(O(n^2)\)), which makes it computationally expensive when processing hyperspectral image data. - Hyperspectral image data usually has a high resolution and a complex three - dimensional structure, so a method that can efficiently process three - dimensional data and reduce computational complexity is required. 2. **Training sample requirement problem**: - ViT requires a large number of training samples to achieve good performance, while the labeling cost of hyperspectral image data is high, and it is difficult to obtain a large amount of labeled data in practical applications. - Therefore, a model that can still effectively extract features and perform classification under the condition of limited samples is required. 3. **Limitations of a single structure**: - Although using 3D ConvNet alone can handle local features well, it performs poorly in capturing long - distance dependencies. - Although using ViT alone can capture global features well, its computational complexity and the requirement for training samples limit its wide application in hyperspectral image classification. To solve these problems, the paper proposes a new model - **3D Relational ConvNet (3D - RCNet)**, which embeds the self - attention mechanism of Transformer into the convolution operation and designs a 3D relational convolutional operation. This design inherits the efficiency of ConvNet and the flexibility of ViT, thus achieving better performance in the hyperspectral image classification task. ### Main contributions 1. **Proposing 3D Relational Convolutional Block (3D - RCBlock)**: - Embed the self - attention mechanism into the convolution operation to form a new HSI feature extraction operation, inheriting the advantages of ConvNet and ViT. 2. **Constructing a hybrid network framework**: - Based on the proposed 3D - RCBlock, construct a hybrid network framework and seamlessly integrate 3D - RCBlock into the classical 3D ConvNet. 3. **Conducting exhaustive ablation experiments**: - Analyze each module in detail and provide comprehensive guiding conclusions to help optimize the model structure. Through these improvements, 3D - RCNet shows excellent classification performance on three publicly representative hyperspectral image data sets, surpassing previous ConvNet and ViT methods.

3D-RCNet: Learning from Transformer to Build a 3D Relational ConvNet for Hyperspectral Image Classification

A Novel Transformer Network with a CNN-Enhanced Cross-Attention Mechanism for Hyperspectral Image Classification

Multiscale 3-D-2-D Mixed CNN and Lightweight Attention-Free Transformer for Hyperspectral and LiDAR Classification

Channel and band attention embedded 3D CNN for model development of hyperspectral image in object-scale analysis

Learning a 3D-CNN and Convolution Transformers for Hyperspectral Image Classification

3D-Convolution Guided Spectral-Spatial Transformer for Hyperspectral Image Classification

RDTN: Residual Densely Transformer Network for hyperspectral image classification

Deep Hierarchical Vision Transformer for Hyperspectral and LiDAR Data Classification

Hierarchical Attention Transformer for Hyperspectral Image Classification

Random Convolutional Network for Hyperspectral Image Classification.

Hybrid Conv-ViT Network for Hyperspectral Image Classification

3D Convolutional Siamese Network for Few-Shot Hyperspectral Classification

H-RNet: Hybrid Relation Network for Few-Shot Learning-Based Hyperspectral Image Classification

Three-dimensional Densely Connected Convolutional Network for Hyperspectral Remote Sensing Image Classification

Hyperspectral Image Classification Using Groupwise Separable Convolutional Vision Transformer Network

Hyperspectral Image Transformer Classification Networks

Hyperspectral Image Classification Based on Multibranch Attention Transformer Networks

Hyperspectral Image Classification Based on 3D Coordination Attention Mechanism Network

A Joint Convolutional Cross ViT Network for Hyperspectral and Light Detection and Ranging Fusion Classification

End-to-End Convolutional Network and Spectral-Spatial Transformer Architecture for Hyperspectral Image Classification

A Center-Masked Transformer for Hyperspectral Image Classification