Residual-based Language Models are Free Boosters for Biomedical Imaging

Zhixin Lai,Jing Wu,Suiyao Chen,Yucheng Zhou,Naira Hovakimyan
2024-03-29
Abstract:In this study, we uncover the unexpected efficacy of residual-based large language models (LLMs) as part of encoders for biomedical imaging tasks, a domain traditionally devoid of language or textual data. The approach diverges from established methodologies by utilizing a frozen transformer block, extracted from pre-trained LLMs, as an innovative encoder layer for the direct processing of visual tokens. This strategy represents a significant departure from the standard multi-modal vision-language frameworks, which typically hinge on language-driven prompts and inputs. We found that these LLMs could boost performance across a spectrum of biomedical imaging applications, including both 2D and 3D visual classification tasks, serving as plug-and-play boosters. More interestingly, as a byproduct, we found that the proposed framework achieved superior performance, setting new state-of-the-art results on extensive, standardized datasets in MedMNIST-2D and 3D. Through this work, we aim to open new avenues for employing LLMs in biomedical imaging and enriching the understanding of their potential in this specialized domain.
Computer Vision and Pattern Recognition,Computation and Language,Machine Learning
What problem does this paper attempt to address?
This paper discusses how to use a Residual-based Language Model (R-LLM) as part of the encoder for biomedical imaging tasks, which is a field traditionally lacking in language or text data. In the study, the authors propose an innovative approach of extracting a frozen transformer block from a pre-trained large-scale language model (LLM) and using it as a novel encoding layer to directly process visual tokens. This approach differs from the traditional multimodal vision-language frameworks which usually rely on language-driven prompts and inputs. The research findings show that these LLMs can improve the performance of various biomedical imaging applications, including 2D and 3D visual classification tasks, and achieve new state-of-the-art results on the widely standardized datasets of MedMNIST-2D and 3D. The paper demonstrates through experiments that this strategy can enhance the model's performance in biomedical imaging even without increasing a large amount of dataset or significantly increasing computational requirements. Furthermore, the paper highlights two major challenges faced in training these models: the need for a large amount of carefully annotated data and the complexity of model optimization. To address these challenges, the paper proposes using the transformer block of the LLM as an effective encoder for visual data, enhancing performance with a simple structure without relying on language elements. In conclusion, this paper introduces a new approach of applying LLMs to biomedical imaging, improving the efficiency and accuracy of the models, and opening new avenues for future LLM utilization in this specialized field.