Rawsamble: Overlapping and Assembling Raw Nanopore Signals using a Hash-based Seeding Mechanism

Can Firtina,Maximilian Mordig,Harun Mustafa,Sayan Goswami,Nika Mansouri Ghiasi,Stefano Mercogliano,Furkan Eris,Joël Lindegger,Andre Kahles,Onur Mutlu
2024-10-23
Abstract:Raw nanopore signal analysis is a common approach in genomics to provide fast and resource-efficient analysis without translating the signals to bases (i.e., without basecalling). However, existing solutions cannot interpret raw signals directly if a reference genome is unknown due to a lack of accurate mechanisms to handle increased noise in pairwise raw signal comparison. Our goal is to enable the direct analysis of raw signals without a reference genome. To this end, we propose Rawsamble, the first mechanism that can 1) identify regions of similarity between all raw signal pairs, known as all-vs-all overlapping, using a hash-based search mechanism and 2) use these to construct genomes from scratch, called de novo assembly. Our extensive evaluations across multiple genomes of varying sizes show that Rawsamble provides a significant speedup (on average by 16.36x and up to 41.59x) and reduces peak memory usage (on average by 11.73x and up to by 41.99x) compared to a conventional genome assembly pipeline using the state-of-the-art tools for basecalling (Dorado's fastest mode) and overlapping (minimap2) on a CPU. We find that 36.57% of overlapping pairs generated by Rawsamble are identical to those generated by minimap2. Using the overlaps from Rawsamble, we construct the first de novo assemblies directly from raw signals without basecalling. We show that we can construct contiguous assembly segments (unitigs) up to 2.7 million bases in length (half the genome length of E. coli). We identify previously unexplored directions that can be enabled by finding overlaps and constructing de novo assemblies. Rawsamble is available at <a class="link-external link-https" href="https://github.com/CMU-SAFARI/RawHash" rel="external noopener nofollow">this https URL</a>. We also provide the scripts to fully reproduce our results on our GitHub page.
Genomics
What problem does this paper attempt to address?