Automatic Extraction of Genomic Variants for Locating Precision Oncology Clinical Trials

Hui Chen,Huyan Xiaoyuan,Danqing Hu,Huilong Duan,Xudong Lu
DOI: https://doi.org/10.1007/978-981-19-9865-2_8
2023-01-01
Abstract:The number of precision oncology clinical trials increases dramatically in the era of precision medicine, and locating precision oncology clinical trials can help researchers, physicians and patients learn about the latest cancer treatment options or participate in such trials. However, unstructured and non-standardized genomic variants embedded in narrative clinical trial documents make it difficult to search for precision oncology clinical trials. This study aims to extract and standardize genomic variants automatically for locating precision oncology clinical trials. Patients with genomic variants, including individual variants and category variants that represent a class of individual variants, are inclued or exclued in accordance with eligibility criteria for precision oncology clinical trials. To extract both individual variants and category variants, we designed 5 classes of entities: variation, gene, exon, qualifier, negation, 4 types of relations for composing variants, and 4 types of relations for representing semantics between variants and variants. Further, we developed an information extraction system that had two modules: (1) cascade extraction module based on the pre-trained model BERT, including sentence classification (SC), named entity recognition (NER), and relation classification (RC), and (2) variant normalization module based on rules and dictionaries, including entity normalization (EN), and post-processing (PP). The system was developed and evaluated on eligibility criteria texts of 400 non-small cell lung cancer clinical trials downloaded from ClinicalTrials.gov. The experimental results showed that F1 score of end-to-end extraction is 0.84. The system was further evaluated on additional 50 multi-cancer clinical trial texts and achieved a F1 score of 0.71, which demonstrated the generalizability of our system. In conclusion, we developed an information extraction system for clinical trial genomic variants extraction that is capable of extracting both individual variants and category variants, and experimental results demonstrate that the extracted results have significant potential for locating precision oncology clinical trials.
What problem does this paper attempt to address?