Universal Medical Image Representation Learning with Compositional Decoders

Kaini Wang,Ling Yang,Siping Zhou,Guangquan Zhou,Wentao Zhang,Bin Cui,Shuo Li
2024-10-07
Abstract:Visual-language models have advanced the development of universal models, yet their application in medical imaging remains constrained by specific functional requirements and the limited data. Current general-purpose models are typically designed with task-specific branches and heads, which restricts the shared feature space and the flexibility of model. To address these challenges, we have developed a decomposed-composed universal medical imaging paradigm (UniMed) that supports tasks at all levels. To this end, we first propose a decomposed decoder that can predict two types of outputs -- pixel and semantic, based on a defined input queue. Additionally, we introduce a composed decoder that unifies the input and output spaces and standardizes task annotations across different levels into a discrete token format. The coupled design of these two components enables the model to flexibly combine tasks and mutual benefits. Moreover, our joint representation learning strategy skilfully leverages large amounts of unlabeled data and unsupervised loss, achieving efficient one-stage pretraining for more robust performance. Experimental results show that UniMed achieves state-of-the-art performance on eight datasets across all three tasks and exhibits strong zero-shot and 100-shot transferability. We will release the code and trained models upon the paper's acceptance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key challenges currently faced by general - purpose medical image analysis models: 1. **Task Diversity and Flexibility**: - Existing general - purpose models are usually designed with specific task branches and heads, which limit the flexibility of the shared feature space and make it difficult to handle semantic understanding and visual tasks simultaneously (such as locating lesions and identifying their types). - A model that can switch between different tasks seamlessly is required, allowing users to customize functions according to specific scenarios (for example, switching between detection and segmentation tasks when it comes to lesion screening or resection procedures). 2. **Data Volume and Annotation Diversity**: - The amount of medical image data is relatively limited, and the annotation methods are diverse. Different tasks require different levels of annotation (for example, classification tasks require image - level annotation, segmentation tasks require pixel - level annotation, and referring segmentation tasks combine text and pixel - level annotation). - Existing methods usually handle multi - task learning by adding extra branches or heads, which increases the model complexity and the difficulty of task balancing. 3. **Cross - task Collaboration and Knowledge Sharing**: - The annotation content varies greatly among different datasets, making it difficult to directly integrate and use these datasets. Mainstream methods split datasets with different annotations into multiple subsets for training, which significantly increases the computational complexity and limits the knowledge sharing between different annotations. 4. **Transferability**: - The model needs to have strong transferability to ensure that it can still provide high - quality predictions when faced with new data. To solve these problems, the authors propose a new general - purpose medical image analysis model - **UniMed**, which has the following features: - **Decomposition - Composition Decoder**: A decomposition decoder is introduced, which can predict two types of outputs (pixels and semantics) based on the defined input queue, and a combination decoder, which unifies the input and output spaces and standardizes the annotations of tasks at different levels into a discrete token format. - **Joint Representation Learning Strategy**: Utilize a large amount of unlabeled data and unsupervised loss to achieve efficient one - stage pre - training and improve the robust performance of the model. - **Cross - task Collaboration**: Through the design of the decomposition and combination decoders, the model can flexibly combine tasks and achieve mutual collaboration, supporting various task interactions. Experimental results show that UniMed achieves state - of - the - art performance on eight datasets and demonstrates strong zero - shot and few - shot transferability.