Classification of cancer pathology reports: a large-scale comparative study

Stefano Martina,Leonardo Ventura,Paolo Frasconi
DOI: https://doi.org/10.1109/JBHI.2020.3005016
2020-06-30
Abstract:We report about the application of state-of-the-art deep learning techniques to the automatic and interpretable assignment of ICD-O3 topography and morphology codes to free-text cancer reports. We present results on a large dataset (more than 80 000 labeled and 1 500 000 unlabeled anonymized reports written in Italian and collected from hospitals in Tuscany over more than a decade) and with a large number of classes (134 morphological classes and 61 topographical classes). We compare alternative architectures in terms of prediction accuracy and interpretability and show that our best model achieves a multiclass accuracy of 90.3% on topography site assignment and 84.8% on morphology type assignment. We found that in this context hierarchical models are not better than flat models and that an element-wise maximum aggregator is slightly better than attentive models on site classification. Moreover, the maximum aggregator offers a way to interpret the classification process.
Machine Learning,Computation and Language,Image and Video Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to automatically and interpretably assign ICD - O3 (International Classification of Diseases for Oncology, 3rd Edition) anatomical location and morphology codes in cancer pathology reports. Specifically, the authors applied the latest deep - learning techniques to handle this task and conducted experiments on a large number of datasets to evaluate the performance of different model architectures in terms of prediction accuracy and interpretability. These pathology reports are written in Italian, from more than ten years of data collection in Tuscan regional hospitals, containing more than 80,000 annotated reports and 1,500,000 unannotated reports. The main objectives of the paper include: 1. **Improve the level of automation**: Reduce the time and resource consumption of manual review of pathology reports through machine - learning techniques, thereby speeding up the definition of cancer cases and supporting public health decision - making. 2. **Enhance classification accuracy**: Compare the performance of different deep - learning models (such as GRU, GRU with attention mechanism, BERT, and CNN) on multi - class classification tasks, especially for the classification of anatomical locations and morphological types. 3. **Strengthen the interpretability of the model**: Explore how to improve the interpretability of model prediction results through model structure design (for example, using a max - aggregator instead of an attention mechanism) so that human experts can further review the automatic classification results. Through these studies, the paper aims to provide an effective tool that can automatically extract key information from free - text pathology reports to support cancer registration and public health surveillance work.