PathAlign: A vision-language model for whole slide images in histopathology

Faruk Ahmed,Andrew Sellergren,Lin Yang,Shawn Xu,Boris Babenko,Abbi Ward,Niels Olson,Arash Mohtashamian,Yossi Matias,Greg S. Corrado,Quang Duong,Dale R. Webster,Shravya Shetty,Daniel Golden,Yun Liu,David F. Steiner,Ellery Wulczyn
2024-06-28
Abstract:Microscopic interpretation of histopathology images underlies many important diagnostic and treatment decisions. While advances in vision-language modeling raise new opportunities for analysis of such images, the gigapixel-scale size of whole slide images (WSIs) introduces unique challenges. Additionally, pathology reports simultaneously highlight key findings from small regions while also aggregating interpretation across multiple slides, often making it difficult to create robust image-text pairs. As such, pathology reports remain a largely untapped source of supervision in computational pathology, with most efforts relying on region-of-interest annotations or self-supervision at the patch-level. In this work, we develop a vision-language model based on the BLIP-2 framework using WSIs paired with curated text from pathology reports. This enables applications utilizing a shared image-text embedding space, such as text or image retrieval for finding cases of interest, as well as integration of the WSI encoder with a frozen large language model (LLM) for WSI-based generative text capabilities such as report generation or AI-in-the-loop interactions. We utilize a de-identified dataset of over 350,000 WSIs and diagnostic text pairs, spanning a wide range of diagnoses, procedure types, and tissue types. We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization (slide-level triaging). Model-generated text for WSIs was rated by pathologists as accurate, without clinically significant error or omission, for 78% of WSIs on average. This work demonstrates exciting potential capabilities for language-aligned WSI embeddings.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper focuses on the research of visual language models for gigapixel whole slide images (WSIs) in pathology. Due to the huge size of WSIs and the key findings typically covering multiple slides in pathology reports, creating reliable image-text pairs for analysis is challenging. Current methods mostly rely on region annotation of local features or self-supervised learning. The paper proposes a model called PathAlign, which is based on the BLIP-2 framework and utilizes WSIs paired with carefully selected text in pathology reports. This enables cross-modal retrieval applications such as text or image retrieval, as well as incorporating a frozen large language model (LLM) for WSI-based text generation capabilities such as report generation or AI-assisted interaction. The study uses a dataset of over 350,000 pairs of WSIs and diagnostic text for identification, covering a wide range of diagnoses, procedure types, and tissue types. PathAlign aligns WSIs with corresponding pathology report text, allowing pathologists to evaluate text generation and image-to-text retrieval on WSI embeddings, as well as performance testing for WSI classification and workflow prioritization (slide-level sorting). The WSI text generated by the model is rated as accurate by pathologists on average, with no significant clinical errors or omissions, reaching a proportion of 78%. Additionally, the paper discusses related work, including advancements in using visual language models in digital pathology and existing data sources and challenges such as creating image-text pairs and the availability of WSI-level text descriptions. The paper introduces the training method of the model, including using pre-trained WSI encoders and LLM for WSI-level text generation. In conclusion, PathAlign aims to address the problem of effectively utilizing information from pathology reports for semantic understanding and analysis of WSIs, providing new avenues for automated report generation and case-level analysis in pathology.