Abstract:Microscopic interpretation of histopathology images underlies many important diagnostic and treatment decisions. While advances in vision-language modeling raise new opportunities for analysis of such images, the gigapixel-scale size of whole slide images (WSIs) introduces unique challenges. Additionally, pathology reports simultaneously highlight key findings from small regions while also aggregating interpretation across multiple slides, often making it difficult to create robust image-text pairs. As such, pathology reports remain a largely untapped source of supervision in computational pathology, with most efforts relying on region-of-interest annotations or self-supervision at the patch-level. In this work, we develop a vision-language model based on the BLIP-2 framework using WSIs paired with curated text from pathology reports. This enables applications utilizing a shared image-text embedding space, such as text or image retrieval for finding cases of interest, as well as integration of the WSI encoder with a frozen large language model (LLM) for WSI-based generative text capabilities such as report generation or AI-in-the-loop interactions. We utilize a de-identified dataset of over 350,000 WSIs and diagnostic text pairs, spanning a wide range of diagnoses, procedure types, and tissue types. We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization (slide-level triaging). Model-generated text for WSIs was rated by pathologists as accurate, without clinically significant error or omission, for 78% of WSIs on average. This work demonstrates exciting potential capabilities for language-aligned WSI embeddings.

What problem does this paper attempt to address?

The paper focuses on the research of visual language models for gigapixel whole slide images (WSIs) in pathology. Due to the huge size of WSIs and the key findings typically covering multiple slides in pathology reports, creating reliable image-text pairs for analysis is challenging. Current methods mostly rely on region annotation of local features or self-supervised learning. The paper proposes a model called PathAlign, which is based on the BLIP-2 framework and utilizes WSIs paired with carefully selected text in pathology reports. This enables cross-modal retrieval applications such as text or image retrieval, as well as incorporating a frozen large language model (LLM) for WSI-based text generation capabilities such as report generation or AI-assisted interaction. The study uses a dataset of over 350,000 pairs of WSIs and diagnostic text for identification, covering a wide range of diagnoses, procedure types, and tissue types. PathAlign aligns WSIs with corresponding pathology report text, allowing pathologists to evaluate text generation and image-to-text retrieval on WSI embeddings, as well as performance testing for WSI classification and workflow prioritization (slide-level sorting). The WSI text generated by the model is rated as accurate by pathologists on average, with no significant clinical errors or omissions, reaching a proportion of 78%. Additionally, the paper discusses related work, including advancements in using visual language models in digital pathology and existing data sources and challenges such as creating image-text pairs and the availability of WSI-level text descriptions. The paper introduces the training method of the model, including using pre-trained WSI encoders and LLM for WSI-level text generation. In conclusion, PathAlign aims to address the problem of effectively utilizing information from pathology reports for semantic understanding and analysis of WSIs, providing new avenues for automated report generation and case-level analysis in pathology.

PathAlign: A vision-language model for whole slide images in histopathology

WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole-Slide Images

WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering

Multimodal Whole Slide Foundation Model for Pathology

Clinical-grade Multi-Organ Pathology Report Generation for Multi-scale Whole Slide Images via a Semantically Guided Medical Text Foundation Model

SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding

Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction

Histopathology language-image representation learning for fine-grained digital pathology cross-modal retrieval

WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image

PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through Multi-agent Collaboration

PathML: A unified framework for whole-slide image analysis with deep learning

The Rise of AI Language Pathologists: Exploring Two-level Prompt Learning for Few-shot Weakly-supervised Whole Slide Image Classification

Aligning Knowledge Concepts to Whole Slide Images for Precise Histopathology Image Analysis

A visual–language foundation model for pathology image analysis using medical Twitter

Data-efficient and weakly supervised computational pathology on whole-slide images

Generating clinical-grade pathology reports from gigapixel whole slide images with HistoGPT

Slide-based Graph Collaborative Training for Histopathology Whole Slide Image Analysis

Interpretable Vision-Language Survival Analysis with Ordinal Inductive Bias for Computational Pathology

Interpretable Classification of Pathology Whole-Slide Images Using Attention Based Context-Aware Graph Convolutional Neural Network