Geometric deep learning framework for genome assembly

Lovro Vrček,Xavier Bresson,Thomas Laurent,Martin Schmitz,Kenji Kawaguchi,Mile Šikić
DOI: https://doi.org/10.1101/2024.03.11.584353
2024-03-13
Abstract:The critical stage of every genome assembler is identifying paths in assembly graphs that correspond to the reconstructed genomic sequences. The existing algorithmic methods struggle with this, primarily due to repetitive regions causing complex graph tangles, leading to fragmented assemblies. Here, we introduce GNNome, a framework for path identification based on geometric deep learning that enables training models on assembly graphs without relying on existing assembly strategies. By leveraging symmetries inherent to the problem, GNNome reconstructs assemblies with similar or superior contiguity compared to the state-of-the-art tools across several species, sequenced with PacBio HiFi or Oxford Nanopore. With every new genome assembled telomere-to-telomere, the amount of reliable training data at our disposal increases. Combining the straightforward generation of abundant simulated data for diverse genomic structures with the AI approach makes the proposed framework a plausible cornerstone for future work on reconstructing complex genomes with different ploidy and aneuploidy degrees. To facilitate such developments, we make the framework and the best-performing model publicly available, provided as a tool that can directly be used to assemble new haploid genomes.
Bioinformatics
What problem does this paper attempt to address?
The main focus of this paper is on a key problem in genome assembly, which is how to identify the paths in the assembly graph that correspond to the reconstructed genome sequence. Existing algorithmic approaches face difficulties in handling complex graph structures caused by repetitive regions, resulting in fragmented assemblies. To address this, the researchers propose GNNome, a framework based on geometric deep learning that can train models to identify paths in the graph without relying on existing assembly strategies. By leveraging the inherent symmetries of the problem, GNNome achieves assembly continuity comparable to or higher than state-of-the-art tools on PacBio HiFi or Oxford Nanopore sequencing data from multiple species. The main contributions of the paper are: 1. Contiguity: The model is able to generate assembly contiguity comparable to or better than existing best tools even without implementing any algorithmic simplification steps. 2. Transferability: The framework is insensitive to the underlying sequencing technology and can adapt to PacBio HiFi and ONT assembly graphs without any modifications. 3. Development: The transition from traditional C/C++ programming to Python/PyTorch implementation facilitates faster development cycles, and the code repository has been made publicly available to support the development of new tools. Experimental results demonstrate that GNNome achieves similar or higher NG50 and NGA50 metrics compared to hifiasm on multiple haploid genomes, particularly showing significant improvements in the assembly of human chromosomes 21 and 22. Furthermore, despite being trained only on HiFi data, the model performs well on ONT data, indicating good generalization capabilities. In summary, this paper attempts to address the problem of path recognition in genome assembly by introducing geometric deep learning techniques, improving assembly continuity and accuracy, and providing new possibilities for handling complex polyploid and non-diploid genomes in the future.