Empowering Interdisciplinary Research with BERT-Based Models: An Approach Through SciBERT-CNN with Topic Modeling

Darya Likhareva,Hamsini Sankaran,Sivakumar Thiyagarajan
2024-04-23
Abstract:Researchers must stay current in their fields by regularly reviewing academic literature, a task complicated by the daily publication of thousands of papers. Traditional multi-label text classification methods often ignore semantic relationships and fail to address the inherent class imbalances. This paper introduces a novel approach using the SciBERT model and CNNs to systematically categorize academic abstracts from the Elsevier OA CC-BY corpus. We use a multi-segment input strategy that processes abstracts, body text, titles, and keywords obtained via BERT topic modeling through SciBERT. Here, the [CLS] token embeddings capture the contextual representation of each segment, concatenated and processed through a CNN. The CNN uses convolution and pooling to enhance feature extraction and reduce dimensionality, optimizing the data for classification. Additionally, we incorporate class weights based on label frequency to address the class imbalance, significantly improving the classification F1 score and enhancing text classification systems and literature review efficiency.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
This paper mainly addresses the problem of multi-label classification of academic literature. With a large number of academic papers being published daily, researchers find it challenging to keep up with the latest developments in all fields. Traditional multi-label text classification methods often ignore semantic relationships and cannot effectively handle class imbalance. The paper proposes a new approach that combines the SciBERT model and Convolutional Neural Network (CNN) to systematically classify academic abstracts in the Elsevier OA CC-BY corpus using BERT topic modeling. The paper introduces the following points: 1. Using the SciBERT-CNN model, multiple inputs are processed through a paragraph-based strategy, including abstracts, full texts, titles, and BERT topic keywords. The contextual representation of each paragraph is captured using the embedding of the [CLS] token, followed by feature extraction and dimension reduction using CNN to optimize the data for the classification task. 2. Class imbalance is addressed by using label frequency-based class weights, significantly improving F1 scores and enhancing the performance and efficiency of the classification system. 3. The Elsevier OA CC-BY corpus is chosen as the dataset as it covers open access articles from 27 different disciplines, providing a good foundation for training and evaluating the model. 4. The paper analyzes the limitations of existing methods, such as the limitations of the BERT model in academic text classification, and how to improve these limitations through the SciBERT-CNN model and specific balancing strategies. Through experiments, the paper shows that the proposed model significantly reduces misclassifications, improves accuracy and efficiency, especially in the classification of interdisciplinary research papers. Future work will explore data augmentation and the integration of domain-specific keywords to further improve the performance of the model.