Ultra-deep sequencing with unique molecular identifier(UMI) for detection of ctDNA by fragment profiling using machine learning.
Hu Yukai,Hu Nan,Liao Rui,Wang Bing,Duan Xiaohong,Yang Chunyan,Wang Lifen,Qianhui Wan,Zhihua Pei,Zhou Qiming,Dongliang Wang
DOI: https://doi.org/10.1200/jco.2022.40.16_suppl.e15508
IF: 45.3
2022-06-01
Journal of Clinical Oncology
Abstract:e15508 Background: Liquid biopsy has been well known for its potential in cancer detection, non-invasive tumor genotyping and disease surveillance. However, ctDNA levels are low in the early stages of monitoring and postoperative progression of most tumors, which makes detection and analysis of ctDNA quite complicated. Methods: In our study, fixed-sequence UMI double-ended sequencing with a 1123-gene panel was used both on blood samples from 200 healthy donors and tissue-plasma samples from 1000 colorectal cancer (CRC) patients. First, GATK mutation detection was performed on tissue and plasma samples from 700 CRC patients to obtain trustable positive mutation sites in DBSNP138 database, meanwhile Samtools MPileUP analysis was performed on 200 healthy samples to obtain negative sites at low frequencies below 1%. Then, six features were extracted from the supporting variation sequences for the two above types of loci: IS (insert size), VBQ (variation base quality), MRBQ (mean read base quality), PIR (position in read), MQ (mapping quality) and R1&R2 (read 1 and read 2 of the paired-end sequencing). All the above data were used to establish the training set model; after that, SNV background noise was modeled by SVM and optimized through five-fold cross-validation. Last, plasma samples from 300 CRC patients were used as a validation set to verify the accuracy of the model. Results: The test set of 300 CRC patients was used to verify the accuracy of the model, in which, the effectiveness of VAF mutation frequency above 0.2% was 95%, and above 0.02% was 80% in plasma. Conclusions: We established a specific method based on UMI two-terminal capture with machine learning modeling, which was significantly superior to any available method in eliminating noise background errors and filtering false-positive low-frequency mutation.
oncology