Abstract:INTRODUCTION: Our understanding of gene properties has advanced through representation learning such as Alpha fold. Representation learning involves encoding the relationships between genes by embedding them into a numerical space. These embeddings, which capture complex genetic interactions and characteristics, can then be leveraged by machine learning models to predict various gene properties. Current embeddings derive from transcriptome or sequence data. Over the past 150 years, numerous experimental assays have uncovered gene functions and interactions- comprehensive knowledge documented in the literature but not always evident in transcriptome or sequence data. It has been posited that leveraging this knowledge to create gene embeddings; however, could result in machine learning models biased towards well-studied genes. METHODS AND RESULTS: We tested this hypothesis by developing a novel knowledge-embedding framework, GeneLLM. During training, GeneLLM learns to comprehend summaries of every gene- a compressed form of published knowledge- using Large Language Models (LLMs), fine-tuned for downstream tasks mapping cellular properties and biochemical processes. Despite the expected bias towards well-known genes, GeneLLM surprisingly showed high predictive power for an array of gene properties. Compared to baseline models, GeneLLM boosted an increase in performance of 20.3% correlation in gene conservation across species, 8.6% and 57.2% prediction accuracy in subcellular localization and gene ontology respectively. GeneLLM also showed competitive results on solubility prediction with 0.91 accuracy and a correlation of 0.71 for tissue-specific expression levels for 1001 cell lines. We also showed that the bias toward well-known genes could be mitigated by combining GeneLLM representation with transcriptome or sequence-based embedding. The combined embeddings exhibited superior performance to their individual components which suggests that GeneLLM extracts views complementary to existing embedding methods. CONCLUSION: The GeneLLM framework demonstrates the ability of LLMs to extract information from the rich knowledge available about the nexus of genes and their cellular traits. It also illustrates how bias in knowledge representation is complementary to other transcriptome and sequence-based information. This ability of GeneLLM to advance our understanding of genes, their roles in cellular processes, and their impact on oncogenesis, as well as in response and resistance mechanisms, highlights its potential in cancer research. Citation Format: Ala Jararweh, Kushal Virupakshappa, Oladimeji S. Macaulay, Aaron Segura, Olufunmilola M. Oyebamiji, Yue Hu, Avinash D. Sahu. GeneLLM: Unveiling gene functions through literature-driven transformer embeddings [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular s); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl) nr 3534.

LORE: A Literature Semantics Framework for Evidenced Disease-Gene Pathogenicity Prediction at Scale

Gene-associated Disease Discovery Powered by Large Language Models

Literature mining discerns latent disease–gene relationships

A comprehensive evaluation of large language models in mining gene relations and pathway knowledge

A Comprehensive Evaluation of Large Language Models in Mining Gene Interactions and Pathway Knowledge

Towards Maps of Disease Progression: Biomedical Large Language Model Latent Spaces For Representing Disease Phenotypes And Pseudotime

Large Language Models with Retrieval-Augmented Generation for Zero-Shot Disease Phenotyping

Large Language Models and Medical Knowledge Grounding for Diagnosis Prediction

Abstract 3534: GeneLLM: Unveiling gene functions through literature-driven transformer embeddings

Leveraging A Medical Knowledge Graph into Large Language Models for Diagnosis Prediction

Creation and evaluation of full-text literature-derived, feature-weighted disease models of genetically determined developmental disorders

Geneverse: A collection of Open-source Multimodal Large Language Models for Genomic and Proteomic Research

Large Language Models for Disease Diagnosis: A Scoping Review

A Hybrid Framework with Large Language Models for Rare Disease Phenotyping

Large language models for extracting histopathologic diagnoses from electronic health records

Building a literature knowledge base towards transparent biomedical AI

Utilizing LLMs for Enhanced Argumentation and Extraction of Causal Knowledge from Scientific Literature

Explainable Biomedical Hypothesis Generation via Retrieval Augmented Generation enabled Large Language Models

Large Language Model-Based Natural Language Encoding Could Be All You Need for Drug Biomedical Association Prediction