CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions

Max Schubach,Thorben Maass,Lusiné Nazaretyan,Sebastian Röner,Martin Kircher
DOI: https://doi.org/10.1093/nar/gkad989
IF: 14.9
2024-01-06
Nucleic Acids Research
Abstract:Machine Learning-based scoring and classification of genetic variants aids the assessment of clinical findings and is employed to prioritize variants in diverse genetic studies and analyses. Combined Annotation-Dependent Depletion (CADD) is one of the first methods for the genome-wide prioritization of variants across different molecular functions and has been continuously developed and improved since its original publication. Here, we present our most recent release, CADD v1.7. We explored and integrated new annotation features, among them state-of-the-art protein language model scores (Meta ESM-1v), regulatory variant effect predictions (from sequence-based convolutional neural networks) and sequence conservation scores (Zoonomia). We evaluated the new version on data sets derived from ClinVar, ExAC/gnomAD and 1000 Genomes variants. For coding effects, we tested CADD on 31 Deep Mutational Scanning (DMS) data sets from ProteinGym and, for regulatory effect prediction, we used saturation mutagenesis reporter assay data of promoter and enhancer sequences. The inclusion of new features further improved the overall performance of CADD. As with previous releases, all data sets, genome-wide CADD v1.7 scores, scripts for on-site scoring and an easy-to-use webserver are readily provided via https://cadd.bihealth.org/ or https://cadd.gs.washington.edu/ to the community.
biochemistry & molecular biology
What problem does this paper attempt to address?
The main objective of this paper is to introduce the new version of the Combined Annotation-Dependent Depletion (CADD) tool—CADD v1.7, and to demonstrate its improvements in the assessment of genetic variants. Specifically, the research team explored and integrated new annotation features to enhance CADD's performance in genome-wide variant prediction. These new features include: 1. **Protein language model scores** (e.g., Meta ESM-1v): Used to assess the impact of variants in coding regions. 2. **Sequence-based convolutional neural networks (CNNs)**: Used to predict the effects of variants in regulatory regions. 3. **Sequence conservation scores** (such as information provided by the Zoonomia project): Used to measure the conservation of specific sequences throughout evolution. By integrating these new features into CADD v1.7, the research team aims to further improve CADD's overall performance in predicting the functional impact of various types of variants, including both coding and non-coding variants. They used datasets from ClinVar, ExAC/gnomAD, and 1000 Genomes for evaluation and demonstrated improvements of CADD v1.7 over previous versions in various benchmarks. In summary, the problem this paper attempts to address is to improve the accuracy and predictive capability in the functional assessment of genetic variants, particularly those that may be disease-related but are difficult to directly evaluate. By introducing advanced protein language models, regulatory sequence models, and updated conservation information, CADD v1.7 is better able to identify variants with potential clinical significance.