CellViT: Vision Transformers for Precise Cell Segmentation and Classification

Fabian Hörst,Moritz Rempe,Lukas Heine,Constantin Seibold,Julius Keyl,Giulia Baldini,Selma Ugurel,Jens Siveke,Barbara Grünwald,Jan Egger,Jens Kleesiek
2023-10-06
Abstract:Nuclei detection and segmentation in hematoxylin and eosin-stained (H&E) tissue images are important clinical tasks and crucial for a wide range of applications. However, it is a challenging task due to nuclei variances in staining and size, overlapping boundaries, and nuclei clustering. While convolutional neural networks have been extensively used for this task, we explore the potential of Transformer-based networks in this domain. Therefore, we introduce a new method for automated instance segmentation of cell nuclei in digitized tissue samples using a deep learning architecture based on Vision Transformer called CellViT. CellViT is trained and evaluated on the PanNuke dataset, which is one of the most challenging nuclei instance segmentation datasets, consisting of nearly 200,000 annotated Nuclei into 5 clinically important classes in 19 tissue types. We demonstrate the superiority of large-scale in-domain and out-of-domain pre-trained Vision Transformers by leveraging the recently published Segment Anything Model and a ViT-encoder pre-trained on 104 million histological image patches - achieving state-of-the-art nuclei detection and instance segmentation performance on the PanNuke dataset with a mean panoptic quality of 0.50 and an F1-detection score of 0.83. The code is publicly available at <a class="link-external link-https" href="https://github.com/TIO-IKIM/CellViT" rel="external noopener nofollow">this https URL</a>
Image and Video Processing,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the problem of nuclear detection and segmentation in Hematoxylin and Eosin (H&E) stained tissue images. This is an important clinical task and is crucial for various applications. However, this task is challenging due to the variations in nuclear staining, size, boundary overlap, and aggregation. Although Convolutional Neural Networks (CNNs) have been widely used for this task, the authors explore the potential of Transformer-based networks in this field. Specifically, the paper proposes a new method called CellViT for automatic instance segmentation and classification of nuclei in digitized tissue samples. CellViT is based on the Vision Transformer architecture and achieves the best performance on the PanNuke dataset, which contains nearly 200,000 annotated nuclei involving 19 different tissue types and 5 clinically significant nuclear categories, through large-scale pre-training and fine-tuning on specific datasets. The main contributions include: 1. Proposing a novel U-Net shaped encoder-decoder network that utilizes the Vision Transformer as the encoder network, significantly surpassing existing nuclear detection methods and achieving segmentation results comparable to other state-of-the-art methods on the PanNuke dataset. 2. Applying the Vision Transformer for the first time to nuclear instance segmentation on the PanNuke dataset, demonstrating its effectiveness in this field. The method combines a pre-trained ViT encoder with a decoder network connected through skip connections. 3. Providing a framework capable of fast inference on Gigapixel WSI, using large inference blocks of 1024×1024 pixels, which is 1.85 times faster than traditional 256-pixel blocks. Through these innovations, CellViT not only improves the accuracy of nuclear detection and segmentation but also provides a reliable feature extraction tool for downstream tasks.