Abstract:The missing modality issue is critical but non-trivial to be solved by multi-modal models. Current methods aiming to handle the missing modality problem in multi-modal tasks, either deal with missing modalities only during evaluation or train separate models to handle specific missing modality settings. In addition, these models are designed for specific tasks, so for example, classification models are not easily adapted to segmentation tasks and vice versa. In this paper, we propose the Shared-Specific Feature Modelling (ShaSpec) method that is considerably simpler and more effective than competing approaches that address the issues above. ShaSpec is designed to take advantage of all available input modalities during training and evaluation by learning shared and specific features to better represent the input data. This is achieved from a strategy that relies on auxiliary tasks based on distribution alignment and domain classification, in addition to a residual feature fusion procedure. Also, the design simplicity of ShaSpec enables its easy adaptation to multiple tasks, such as classification and segmentation. Experiments are conducted on both medical image segmentation and computer vision classification, with results indicating that ShaSpec outperforms competing methods by a large margin. For instance, on BraTS2018, ShaSpec improves the SOTA by more than 3% for enhancing tumour, 5% for tumour core and 3% for whole tumour. The code repository address is <a class="link-external link-https" href="https://github.com/billhhh/ShaSpec/" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
This paper aims to solve the problem of missing modalities in multi - modal learning. Specifically, current methods for dealing with missing modalities either only handle missing modalities at the evaluation stage or set up separate models for specific missing modalities for training. These methods are usually designed for specific tasks. For example, classification models are difficult to adapt to segmentation tasks and vice versa. Therefore, the paper proposes a new method - Shared - Specific Feature Modelling (ShaSpec). This method can utilize all available input modalities during both training and evaluation, and better represent input data by learning shared and specific features. In addition, the design of ShaSpec is simple, enabling it to easily adapt to multiple tasks such as classification and segmentation.
### Main contributions
1. **Proposed an extremely simple and effective multi - modal learning method**: Based on modelling and fusing shared and specific features, ShaSpec can handle missing modalities in training and evaluation, and support dedicated and non - dedicated training.
2. **Achieved multi - task adaptation for the first time**: As far as the authors know, ShaSpec is the first missing - modality multi - modal method that can easily adapt to classification and segmentation tasks.
### Experimental results
The paper conducted experiments on computer vision classification and medical image segmentation benchmarks, and the results show that ShaSpec has achieved state - of - the - art performance. In particular, on the BraTS2018 dataset, compared with recently proposed competing methods, ShaSpec has increased the segmentation accuracy of enhancing tumors, tumor cores, and the whole tumor by more than 3%, 5% and 3% respectively.
### Method overview
#### 3.1 Overall architecture
The ShaSpec model consists of a shared encoder \( f_{\theta_{\text{sha}}} \), specific encoders \( f_{\theta_{\text{spec}}}^{(i)} \), a feature projection layer \( f_{\theta_{\text{proj}}} \) and a decoder \( f_{\theta_{\text{dec}}} \). For data \( M_j=\{x^{(i)}_j\}_{i = 1}^N \) of \( N \) modalities, each modality \( x^{(i)}_j\in X \) extracts shared features \( r^{(i)} \) and specific features \( s^{(i)} \) through the shared encoder and specific encoders. Then, the shared and specific features are fused through a residual fusion process, and finally the decoder generates the prediction result.
#### 3.2 Evaluation of complete and missing modalities
- **Complete modalities**:
\[
r^{(i)}=f_{\theta_{\text{sha}}}(x^{(i)}),\quad s^{(i)}=f_{\theta_{\text{spec}}}^{(i)}(x^{(i)})
\]
\[
f^{(i)}=f_{\theta_{\text{proj}}}(r^{(i)}, s^{(i)})+r^{(i)}
\]
\[
\tilde{y}=f_{\theta_{\text{dec}}}(f^{(1)},\ldots,f^{(N)})
\]
- **Missing modalities**:
Suppose the \( n \) - th modality is missing. The feature extraction process for other modalities is the same, but the feature \( f^{(n)} \) of the missing modality is generated in the following way:
\[
f^{(n)}=\frac{1}{N - 1}\sum_{i = 1, i\neq n}^N r^{(i)}
\]
#### 3.3 Training process
In addition to optimizing the main task (classification or segmentation), two auxiliary tasks are introduced: domain classification and distribution alignment, to learn specific and shared feature representations.
- **Domain classification objective**:
\[
\ell_{\text{dco}}(D,\theta_{\text{s}})