Self-Supervised Representation Learning for Basecalling Nanopore Sequencing Data

Carlos Vintimilla,Sangheum Hwang
DOI: https://doi.org/10.1109/access.2024.3440882
IF: 3.9
2024-08-18
IEEE Access
Abstract:Basecalling is a complex task that involves translating noisy raw electrical signals into their corresponding DNA sequences. Several deep learning architectures have been successful in improving basecalling accuracy, but all of them rely on a supervised training scheme and require large annotated datasets to achieve high accuracy. However, obtaining labeled data for some species can be extremely challenging, making it difficult to generate a large amount of ground truth labels for training basecalling models. Self-supervised representation learning (SSL) has been shown to alleviate the need for large annotated datasets and, in some cases, enhance model performance. In this work, we investigate the effectiveness of self-supervised representation learning frameworks on the basecalling task. We consider SSL basecallers based on two well-known SSL frameworks, SimCLR and wav2vec2.0, and show that the self-supervised trained basecaller outperforms its supervised counterparts in both low and high data regimes, showing up to a 3% increase in performance when trained on only 1% of the total labeled data. Our results suggest that learning strong representations from unlabeled data can improve basecalling accuracy compared to state-of-the-art models across different architectures. Furthermore, we provide insights into representation learning for the basecalling task and discuss the role of continuous representations during SSL pretraining. Our code is publicly available at https://github.com/carlosvint/SSLBasecalling.
computer science, information systems,telecommunications,engineering, electrical & electronic
What problem does this paper attempt to address?