ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders

Shawn Xu,Lin Yang,Christopher Kelly,Marcin Sieniek,Timo Kohlberger,Martin Ma,Wei-Hung Weng,Atilla Kiraly,Sahar Kazemzadeh,Zakkai Melamed,Jungyeon Park,Patricia Strachan,Yun Liu,Chuck Lau,Preeti Singh,Christina Chen,Mozziyar Etemadi,Sreenivasa Raju Kalidindi,Yossi Matias,Katherine Chou,Greg S. Corrado,Shravya Shetty,Daniel Tse,Shruthi Prabhakara,Daniel Golden,Rory Pilgrim,Krish Eswaran,Andrew Sellergren
2023-09-08
Abstract:In this work, we present an approach, which we call Embeddings for Language/Image-aligned X-Rays, or ELIXR, that leverages a language-aligned image encoder combined or grafted onto a fixed LLM, PaLM 2, to perform a broad range of chest X-ray tasks. We train this lightweight adapter architecture using images paired with corresponding free-text radiology reports from the MIMIC-CXR dataset. ELIXR achieved state-of-the-art performance on zero-shot chest X-ray (CXR) classification (mean AUC of 0.850 across 13 findings), data-efficient CXR classification (mean AUCs of 0.893 and 0.898 across five findings (atelectasis, cardiomegaly, consolidation, pleural effusion, and pulmonary edema) for 1% (~2,200 images) and 10% (~22,000 images) training data), and semantic search (0.76 normalized discounted cumulative gain (NDCG) across nineteen queries, including perfect retrieval on twelve of them). Compared to existing data-efficient methods including supervised contrastive learning (SupCon), ELIXR required two orders of magnitude less data to reach similar performance. ELIXR also showed promise on CXR vision-language tasks, demonstrating overall accuracies of 58.7% and 62.5% on visual question answering and report quality assurance tasks, respectively. These results suggest that ELIXR is a robust and versatile approach to CXR AI.
Computer Vision and Pattern Recognition,Image and Video Processing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the insufficient generalization ability of existing medical imaging artificial intelligence systems when dealing with new tasks. Specifically, traditional medical imaging AI systems are usually focused on highly specific tasks and perform inconsistently when extended to new problems. The paper proposes a new method. By combining large - language models (LLMs) and radiology visual encoders, it aims to improve the versatility and adaptability of the model, so that multi - modal models can be trained efficiently, using routinely collected medical images and their associated text data to perform multiple tasks, such as zero - shot classification, data - efficient classification, semantic search, visual question answering (VQA) and radiology report quality assurance (QA). This method is expected to unlock a new generation of medical AI applications and support workflows including high - performance zero - shot and data - efficient classification, semantic search, visual question answering and radiology report quality assurance.