Filterlap: Filtering False-Positive Mutation Calls Via A Label Propagation Framework
Xuwen Wang,Xiaoyan Zhu,Xiao Xiao,Shenjie Wang,Xuanping Zhang,Xin Lai,Jiayin Wang
DOI: https://doi.org/10.1109/BIBM47256.2019.8983354
2019-01-01
Abstract:Benefiting from the recent advantages of genomic sequencing, detecting genomic mutations becomes a routine work in precise diagnoses and treatments for cancers. In clinical practices, many factors, such as tumor purity, clonal structure, etc., interfere the performance of calling mutations. The computational pipelines prefer to sensitively report the candidate calls, while a filter is applied for removing the false-positive calls. The existing filters rely on the whole genome/exome sequencing data, which can provide sufficient samples for training the filters. However, the gene-panel sequencing is more popular in clinical practices, but there is no practical filter for limited training samples. In light of this, we develop a semi-learning filter for gene-panel sequencing data, FilterLAP, which implemented via a label propagation framework. Given few labeled samples with a set of unlabeled ones, its basic idea is to predict the label information of unlabeled nodes from the label information of labeled nodes, and establishes a complete graph model by using the relationship between samples, by combining transductive inference with label propagation algorithm. For each node in the network, tags are propagated to adjacent nodes according to similarity and the probability distribution of similar nodes tends to be similar and can be divided into a class. We perform multiple sets of experiments on gene-panel sequencing data captured from Illumina platform. FilterLAP outperforms on both SNV and INDEL filtering, where the AUCs reach 0.90-0.97, and the average accuracies on overall mutation calls are over 90%. Comparing to GATK hard filters, FilterLAP present a 5% improvement on accuracy. These results demonstrate that the proposed method can better reduce the false positive mutation calls on gene-panel sequencing data. In addition, it is stable and efficient, which can be used as a practical tool for mutation call filtering for gene-panel sequencing data.