A powerful random forest-based pipeline for accurately identifying the structural variants on panel sequencing data.
Guiwu Zhuang,Yu-Long Chen,Hongzhen Tang,Wei Wang,Peng Wang,Edmond Gao,Shencun Fang
DOI: https://doi.org/10.1200/jco.2023.41.16_suppl.e18898
IF: 45.3
2023-06-01
Journal of Clinical Oncology
Abstract:e18898 Background: Numerous studies have reported that the identification of structural variants (SVs) can provide ideal targets for cancer targeting therapy. However, on the one hand, the current algorithms exhibited different performances in detecting different types and sizes of SVs; on the other hand, the whole-genome/exome sequencing for detecting SVs is costly, and only a portion of SVs can be used to guide the clinical treatment of cancer patients. Therefore, developing a powerful, highly accurate, and cost-effective pipeline to detect SVs for clinical application is urgently required. Methods: Based on simulated sequencing data, four common SV detection tools (Delly, Lumpy, SvABA, and Manta) were used to select the best performance one. This tool will be used for detecting SVs in panel sequencing data. Then, Integrative Genomics Viewer (IGV) was used to annotate these SVs as true positive (TP) and false positive (FP) SVs. These annotated TP and FP SVs were used as input in the random forest classifier (RFC) to train a model for predicting the true or false positive SVs. Two independent testing cohorts and standards were used to validate this constructed pipeline. In addition, IHC/FISH experiments were performed to further validate these predicted TP SVs. Results: In the simulation data, the detection tool with the highest SVs identification sensitivity was Delly. By Delly, a total of 1,303 SVs in 384 tumor samples were detected. Based on annotated TP and FP SVs, an RFC model was constructed in the training cohort, which predicted 334 and 560 TP and FP SVs, respectively. The accuracy rates (AR) were 99.85% and 100% of TP and FP SVs, respectively. The predicted TP and FP SVs by constructed pipeline also showed a high accuracy rate in cohort 1 (98.96% and 99.24% of TP and FP SVs, respectively), cohort 2 (98.48% and 100% of TP and FP SVs, respectively), and standards (the mean accuracy rate was 90%). Finally, based on these TP SVs predicted by the constructed pipeline, FISH/IHC experiment results further verified the robustness of this pipeline (the man accuracy rate was 95.83%). Conclusions: The constructed random forest-based pipeline can robustly and accurately identify SVs based on panel sequencing data, which will be helpful for clinical decision-making.
oncology