Abstract:Background: Given the threat posed by cancer to human health, there is a rapid growth in the volume of data in the cancer field and interdisciplinary and collaborative research is becoming increasingly important for fine-grained classification. The low-resolution classifier of reported studies at the journal level fails to satisfy advanced searching demands, and a single label does not adequately characterize the literature originated from interdisciplinary research results. There is thus a need to establish a multilabel classifier with higher resolution to support literature retrieval for cancer research and reduce the burden of screening papers for clinical relevance. Objective: The primary objective of this research was to address the low-resolution issue of cancer literature classification due to the ambiguity of the existing journal-level classifier in order to support gaining high-relevance evidence for clinical consideration and all-sided results for literature retrieval. Methods: We trained a multilabel classifier with scalability for classifying the literature on cancer research directly at the publication level to assign proper content-derived labels based on the "Bidirectional Encoder Representation from Transformers (BERT) + X" model and obtain the best option for X. First, a corpus of 70,599 cancer publications retrieved from the Dimensions database was divided into a training and a testing set in a ratio of 7:3. Second, using the classification terminology of International Cancer Research Partnership cancer types, we compared the performance of classifiers developed using BERT and 5 classical deep learning models, such as the text recurrent neural network (TextRNN) and FastText, followed by metrics analysis. Results: After comparing various combined deep learning models, we obtained a classifier based on the optimal combination "BERT + TextRNN," with a precision of 93.09%, a recall of 87.75%, and an F1-score of 90.34%. Moreover, we quantified the distinctive characteristics in the text structure and multilabel distribution in order to generalize the model to other fields with similar characteristics. Conclusions: The "BERT + TextRNN" model was trained for high-resolution classification of cancer literature at the publication level to support accurate retrieval and academic statistics. The model automatically assigns 1 or more labels to each cancer paper, as required. Quantitative comparison verified that the "BERT + TextRNN" model is the best fit for multilabel classification of cancer literature compared to other models. More data from diverse fields will be collected to testify the scalability and extensibility of the proposed model in the future.

A Multi-Label Text Classifier at Publication Level Based on "PubMedBERT + TextRNN" for Cancer Literature

A Multilabel Text Classifier of Cancer Literature at the Publication Level: Methods Study of Medical Text Classification

Improving Cancer Hallmark Classification with BERT-based Deep Learning Approach

CancerBERT: a BERT model for Extracting Breast Cancer Phenotypes from Electronic Health Records

Medical-GAT: Cancer Document Classification Leveraging Graph-Based Residual Network for Scenarios with Limited Data

Multi-Label Classification of Research Papers Using Multi-Label K-Nearest Neighbour Algorithm

Multi-label Classification for Clinical Text with Feature-level Attention

Automatic semantic classification of scientific literature according to the hallmarks of cancer

Multi-label annotation of text reports from computed tomography of the chest, abdomen, and pelvis using deep learning

A Cross-institutional Evaluation on Breast Cancer Phenotyping NLP Algorithms on Electronic Health Records

AI-assisted Knowledge Discovery in Biomedical Literature to Support Decision-making in Precision Oncology

Multi-Label Classification For Colon Cancer Using Histopathological Images

Generalizable and automated classification of TNM stage from pathology reports with external validation

Highly accurate classification of chest radiographic reports using a deep learning natural language model pre-trained on 3.8 million text reports

Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types

MCICT: Graph convolutional network-based end-to-end model for multi-label classification of imbalanced clinical text

Extracting comprehensive clinical information for breast cancer using deep learning methods

Multi-class classification of COVID-19 documents using machine learning algorithms

Cancer hallmark analysis using semantic classification with enhanced topic modelling on biomedical literature

Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports