MSVF: Multi-task Structure Variation Filter with Transfer Learning in High-throughput Sequencing
Weiming Xiang,Yingbo Cui,Yaning Yang,Ang Zhang,Boya Ji,Shaoliang Peng
DOI: https://doi.org/10.1109/bibm55620.2022.9995307
2022-01-01
Abstract:The single molecule real-time sequencing technologies, such as PacBio and Nanopore, have higher throughput and produce longer reads, which promote the discovery of more structure variations that cannot be discovered by the second-generation sequencing data. However, compared with the second-generation sequencing data, the PacBio data lacks paired-end sequencing information, making traditional structure variations filter fail to process the new data. To solve this problem, this paper proposes a universal multi-tasking structure variation filtering model MSVF. MSVF adopts the CIGAR string defined in SAM format. CIGAR is not limited by sequencing technology or alignment algorithms, so MSVF is suitable for not only the second-generation but also the third-generation sequencing data. Moreover, CIGAR string preserves the complete sequence alignment information, which makes MSVF a highly precise model. Besides, MSVF uses deep learning methods, making it supports more structure variation types, including deletion and insertion. We trained and tested the models on the open-access NCBI datasets. The experiments proved that ShuffleNet, MobileNet, ResNet transfer learning models achieve better classification results on SVs task. The average AUC reaches more than 90% and the AUC of each category reach more than 87%. The accuracy and AUC of deletion and insertion structure variations were above 90% and above 92%, respectively. The code and data can be obtained at https://github.con weimingxiang/MSVF.