Use of natural language processing to extract and classify papillary thyroid cancer features from surgical pathology reports

Ricardo Loor-Torres,Yuqi Wu,Esteban Cabezas,Mariana Borras,David Toro-Tobon,Mayra Duran,Misk Al Zahidy,Maria Mateo Chavez,Cristian Soto Jacome,Jungwei W. Fan,Naykky M. Singh Ospina,Yonghui Wu,Juan P. Brito
2024-05-23
Abstract:Background We aim to use Natural Language Processing (NLP) to automate the extraction and classification of thyroid cancer risk factors from pathology reports. Methods We analyzed 1,410 surgical pathology reports from adult papillary thyroid cancer patients at Mayo Clinic, Rochester, MN, from 2010 to 2019. Structured and non-structured reports were used to create a consensus-based ground truth dictionary and categorized them into modified recurrence risk levels. Non-structured reports were narrative, while structured reports followed standardized formats. We then developed ThyroPath, a rule-based NLP pipeline, to extract and classify thyroid cancer features into risk categories. Training involved 225 reports (150 structured, 75 unstructured), with testing on 170 reports (120 structured, 50 unstructured) for evaluation. The pipeline's performance was assessed using both strict and lenient criteria for accuracy, precision, recall, and F1-score. Results In extraction tasks, ThyroPath achieved overall strict F-1 scores of 93% for structured reports and 90 for unstructured reports, covering 18 thyroid cancer pathology features. In classification tasks, ThyroPath-extracted information demonstrated an overall accuracy of 93% in categorizing reports based on their corresponding guideline-based risk of recurrence: 76.9% for high-risk, 86.8% for intermediate risk, and 100% for both low and very low-risk cases. However, ThyroPath achieved 100% accuracy across all thyroid cancer risk categories with human-extracted pathology information. Conclusions ThyroPath shows promise in automating the extraction and risk recurrence classification of thyroid pathology reports at large scale. It offers a solution to laborious manual reviews and advancing virtual registries. However, it requires further validation before implementation.
Computation and Language
What problem does this paper attempt to address?
This paper mainly explores the problem of automatically extracting and classifying papillary thyroid carcinoma (PTC) features from thyroid cancer surgical pathology reports using natural language processing (NLP) techniques. The research team analyzed 1,410 surgical pathology reports from adult patients at Mayo Clinic between 2010 and 2019, including both structured and unstructured reports. They developed a rule-based NLP pipeline called ThyroPath to extract 18 pathological features of thyroid cancer and classify recurrence risk based on guidelines. In the extraction task, ThyroPath achieved a strict F1 score of 93% for structured reports and an F1 score of 90% for unstructured reports. In the classification task, ThyroPath achieved an accuracy of 93% in determining recurrence risk categories. However, when using manually extracted pathological information, ThyroPath achieved 100% accuracy across all risk categories. The purpose of ThyroPath is to address the labor-intensive task of manually reviewing pathology reports and promote the development of virtual registries, but further validation is needed for practical applications. The paper points out that the focus of thyroid cancer research has shifted from survival rates to predicting recurrence risk, and the automated processing of pathological report data is critical for risk stratification. However, the current level of structuring in most pathological reports is low, which limits comprehensive information extraction and classification. Through NLP techniques, ThyroPath demonstrates the potential for large-scale automated extraction and classification of thyroid pathology reports, contributing to improved research efficiency and standardized clinical decision-making. However, it requires external validation and fine-tuning on data sources from different institutions to enhance its performance and applicability. Additionally, future versions of ThyroPath will expand to other types of thyroid cancer.