SACall: A Neural Network Basecaller for Oxford Nanopore Sequencing Data Based on Self-Attention Mechanism
Neng Huang,Fan Nie,Peng Ni,Feng Luo,Jianxin Wang
DOI: https://doi.org/10.1109/TCBB.2020.3039244
2022-01-01
IEEE/ACM Transactions on Computational Biology and Bioinformatics
Abstract:Highly portable Oxford Nanopore sequencer producing long reads in real-time at low cost has made many breakthroughs in genomics studies. However, a major limitation of nanopore sequencing is its high errors when deciphering DNA sequences from noisy and complex raw data. In this paper, we developed an end-to-end basecaller, SACall, based on convolution layers, transformer self-attention layers and a CTC decoder. In SACall, the convolution layers are used to downsample the signals and capture the local patterns. To achieve the contextual relevance of signals, self-attention layers are adopted to calculate the similarity of the signals at any two positions in the raw signal sequence. Finally, the CTC decoder generates the DNA sequence by a beam search algorithm. We use a benchmark consisting of nine isolated genomes to test the quality of different basecallers including SACall, Albacore, and Guppy. The performances of basecallers are evaluated from the perspective of read accuracy, assembly quality, and consensus accuracy. Among most of the genomes in the test benchmark, the reads basecalled by SACall have fewer errors than the reads basecalled by other basecallers. When assembling the basecalled reads of each genome, the assembly from SACall basecalled reads achieves a higher assembly identity. In addition, there are fewer errors in the polished assembly from reads basecalled by SACall compared to those basecalled by Albacore and Guppy. In general, SACall outperforms the Nanopore official basecallers Albacore and Guppy in the benchmark. Moreover, SACall is an open-source and freely available basecaller, which gives a chance for researchers to train their own basecalling models on specific data and basecall Nanopore reads.