Can representation learning for multimodal image registration be improved by supervision of intermediate layers?

Elisabeth Wetzer,Joakim Lindblad,Nataša Sladoje
2023-03-01
Abstract:Multimodal imaging and correlative analysis typically require image alignment. Contrastive learning can generate representations of multimodal images, reducing the challenging task of multimodal image registration to a monomodal one. Previously, additional supervision on intermediate layers in contrastive learning has improved biomedical image classification. We evaluate if a similar approach improves representations learned for registration to boost registration performance. We explore three approaches to add contrastive supervision to the latent features of the bottleneck layer in the U-Nets encoding the multimodal images and evaluate three different critic functions. Our results show that representations learned without additional supervision on latent features perform best in the downstream task of registration on two public biomedical datasets. We investigate the performance drop by exploiting recent insights in contrastive learning in classification and self-supervised learning. We visualize the spatial relations of the learned representations by means of multidimensional scaling, and show that additional supervision on the bottleneck layer can lead to partial dimensional collapse of the intermediate embedding space.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper aims to explore whether supervising the representation learning of intermediate layers (especially the bottleneck layer) in multimodal image registration can improve performance. Specifically, the authors evaluate methods for providing additional supervision to the bottleneck layer features of U-Net within a contrastive learning framework to generate better multimodal image representations (CoMIRs), thereby enhancing registration performance. ### Background and Motivation Multimodal imaging techniques can capture complementary information about samples, which is significant in digital pathology. However, the appearance of images generated by different sensors varies greatly, making automatic multimodal image registration very challenging. Traditional manual registration methods are not only time-consuming and labor-intensive but also costly. Therefore, reliable automated multimodal image registration methods are crucial for both research and clinical applications. Recent studies have shown that providing additional supervision to intermediate layers in contrastive learning can improve representation learning in biomedical image classification tasks. Based on this finding, the authors investigate whether similar methods can be applied to multimodal image registration tasks to further enhance the quality of CoMIRs. ### Methods and Experiments The authors propose three methods to add contrastive loss to the bottleneck layer features of U-Net: 1. **Alternating Loss**: Alternately compute the contrastive loss of the final output layer and the bottleneck layer in each iteration. 2. **Weighted Loss**: Simultaneously compute the contrastive loss of the final output layer and the bottleneck layer in each iteration, weighted by a hyperparameter. 3. **Pre-training**: Pre-train the bottleneck layer for 50 epochs, then train the final output layer for another 50 epochs. The authors conducted experiments on two public biomedical datasets, SHG & BF dataset and QPI & FM dataset. Evaluation metrics included registration success rate (RSR) and various image similarity/distance measures. ### Results and Discussion The experimental results show that the baseline method without additional supervision achieved the best registration performance on both datasets. Specifically: - On the SHG & BF dataset, the weighted loss method using L1 norm as the similarity function performed the best but still did not surpass the baseline method. - On the QPI & FM dataset, the pre-training method outperformed the alternating loss method but still fell short of the baseline method. Further analysis revealed that additional supervision of the bottleneck layer might lead to the collapse of some dimensions in the feature space, thereby affecting registration performance. Moreover, visualizing the feature embedding space through multidimensional scaling (MDS) showed that additional supervision caused features to cluster by modality rather than cross-modality similarity. ### Conclusion This study indicates that for multimodal image registration tasks, the CoMIRs generation method without additional supervision performs best in downstream tasks. This contrasts with previous observations in biomedical image classification tasks, suggesting that different tasks have different requirements for representation learning. Future work can further explore how to optimize representation learning methods for multimodal image registration.