Enformation Theory: A Framework for Evaluating Genomic AI

eyes s. robson,Nilah M. Ioannidis
DOI: https://doi.org/10.1101/2024.09.03.611127
2024-09-07
Abstract:The nascent field of genomic AI is rapidly expanding with new models, benchmarks, and findings. As the field diversifies, there is an increased need for a common set of measurement tools and perspectives to standardize model evaluation. Here, we present a statistically grounded framework for performance evaluation, visualization, and interpretation using the prominent genomic AI model Enformer as a case study. The Enformer model has been used for a range of applications from mechanism discovery to variant effect prediction, but what makes it better or worse than precedent models at particular tasks? Our goal is not merely to answer these questions for Enformer, but to propose how we should think about new models in general. We start by reporting Enformer’s few-shot performance on the in benchmark, which emphasizes complex genome interpretation tasks, and discuss its gains and deficits compared to precedent models. We follow this analysis with visualizations of Enformer’s embeddings in low-dimensional space, where, among other insights, we diagnose features of the embeddings that may limit model generalization to synthetic biology tasks. Finally, we present a novel, theory-backed probe of Enformer embeddings, where variance decomposition allows for holistic interpretation and partial ‘backtracking’ to explanatory causal features. Through this case study, we illustrate a new framework, Enformation Theory, for analyzing and interpreting genomic AI models.
Genomics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is in the field of genomic AI, with the rapid expansion of new models, benchmarks, and findings, how to standardize model evaluation tools and perspectives to ensure fair comparison and effective evaluation between different models. Specifically, the paper proposes a statistics - based framework - Enformation Theory, which is used to evaluate, visualize, and interpret the performance of genomic AI models, especially taking the Enformer model as an example for research. ### Main Objectives of the Paper 1. **Performance Evaluation**: - Evaluate the performance of Enformer on the GUAN inE benchmark and compare it with previous models. - Analyze the advantages and disadvantages of Enformer on different tasks. 2. **Visualization Analysis**: - Project the benchmark sequences into a low - dimensional space to better understand the embedding features of the model. - Diagnose the problems in the embedding features, which may limit the generalization ability of the model in synthetic biology tasks. 3. **Model Explanation**: - Propose a new method based on variance decomposition for comprehensive explanation and partial "back - tracking" to explanatory causal features. - Through this method, identify the confounding factors that Enformer has learned or failed to learn, and confirm the relevance of regional - level and chromosome - level features to basic sequence characteristics such as DNA accessibility. ### Specific Problems 1. **How does Enformer perform on the GUAN inE benchmark?** - Report the performance of Enformer's embedding and human output head on the GUAN inE benchmark through linear evaluation (L2 regularization). - Discuss its advantages and disadvantages relative to previous models. 2. **How do Enformer's embedding features perform in the low - dimensional space?** - Use principal component analysis (PCA) to project the sequences in the GUAN inE task into Enformer's training data embedding space. - Through visualization analysis, identify why certain tasks (such as the GPRA task) perform poorly in Enformer's embedding space. 3. **How is Enformer's interpretability?** - Use the method based on variance decomposition to analyze the confounding factors in Enformer's embedding. - Identify the features that Enformer has learned or failed to learn, and confirm the impact of these features on basic sequence characteristics such as DNA accessibility. ### Method Overview 1. **Feature Extraction**: - Use the Enformer model implemented in PyTorch for inference, and extract the single - bin embedding and three - bin average embedding for each task. - Cache the 3,072 - dimensional embedding and 5,313 - dimensional output for subsequent linear evaluation. 2. **Linear Evaluation**: - Use ridge regression in scikit - learn for L2 - regularized linear evaluation. - Search for the optimal regularization coefficient and select the three - bin average or single - bin prediction result that performs best on the development set. 3. **Dimensionality Reduction**: - Use incremental PCA to estimate the first 30 principal components to reduce the amount of computation. - Use t - SNE for further dimensionality reduction to improve running efficiency. 4. **Variance Decomposition Based on Confounding Factors**: - Use the posterior confounding factor analysis method proposed by Dinga et al. to decompose the explained variance. - Use the Pythagorean property of RSS (residual sum of squares) to perform variance decomposition. Through these methods, the paper aims to provide a systematic and transferable framework for analyzing and interpreting genomic AI models, especially the Enformer model. This framework not only helps to understand the performance of existing models but also provides guidance for the development of future models.