Multimodal learning of transcriptomes and text enables interactive single-cell RNA-seq data exploration with natural-language chats

Moritz Schaefer,Peter Peneder,Daniel Malzl,Mihaela Peycheva,Jake Burton,Anna Hakobyan,Varun Sharma,Thomas Krausgruber,Joerg Menche,Eleni M Tomazou,Christoph Bock
DOI: https://doi.org/10.1101/2024.10.15.618501
2024-10-18
Abstract:Single-cell RNA-seq characterizes biological samples at unprecedented scale and detail, but data interpretation remains challenging. Here we introduce CellWhisperer, a multimodal machine learning model and software that connects transcriptomes and text for interactive single-cell RNA-seq data analysis. CellWhisperer enables the chat-based interrogation of transcriptome data in English language. To train our model, we created an AI-curated dataset with over a million pairs of RNA-seq profiles and matched textual annotations across a broad range of human biology, and we established a multimodal embedding of matched transcriptomes and text using contrastive learning. Our model enables free-text search and annotation of transcriptome datasets by cell types, states, and other properties in a zero-shot manner and without the need for reference datasets. Moreover, CellWhisperer answers questions about cells and genes in natural-language chats, using a biologically fluent large language model that we fine-tuned to analyze bulk and single-cell transcriptome data across various biological applications. We integrated CellWhisperer with the widely used CELLxGENE browser, allowing users to interactively explore RNA-seq data through an integrated graphical and chat interface. Our method demonstrates a new way of working with transcriptome data, leveraging the power of natural language for single-cell data analysis and establishing an important building block for future AI-based bioinformatics research assistants.
Bioinformatics
What problem does this paper attempt to address?