Hierarchical Classification of Research Fields in the "Web of Science" Using Deep Learning

Susie Xi Rao,Peter H. Egger,Ce Zhang
2024-07-25
Abstract:This paper presents a hierarchical classification system that automatically categorizes a scholarly publication using its abstract into a three-tier hierarchical label set (discipline, field, subfield) in a multi-class setting. This system enables a holistic categorization of research activities in the mentioned hierarchy in terms of knowledge production through articles and impact through citations, permitting those activities to fall into multiple categories. The classification system distinguishes 44 disciplines, 718 fields and 1,485 subfields among 160 million abstract snippets in Microsoft Academic Graph (version 2018-05-17). We used batch training in a modularized and distributed fashion to address and allow for interdisciplinary and interfield classifications in single-label and multi-label settings. In total, we have conducted 3,140 experiments in all considered models (Convolutional Neural Networks, Recurrent Neural Networks, Transformers). The classification accuracy is > 90% in 77.13% and 78.19% of the single-label and multi-label classifications, respectively. We examine the advantages of our classification by its ability to better align research texts and output with disciplines, to adequately classify them in an automated way, and to capture the degree of interdisciplinarity. The proposed system (a set of pre-trained models) can serve as a backbone to an interactive system for indexing scientific publications in the future.
Digital Libraries,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The main goal of this paper is to develop a hierarchical classification system that can automatically classify academic publications based on their abstracts into a three-tiered label set (discipline, field, subfield). This system aims to achieve an overall classification of knowledge production and impact (through citations) of academic activities and allows these activities to be categorized into multiple categories. Specifically, the paper attempts to address the following issues: 1. Establish a hierarchical text classification system to automatically classify academic literature into different disciplines, fields, and subfields. 2. Provide a method to better align research texts and outputs so that they can be appropriately automatically classified and capture the degree of interdisciplinarity. 3. Develop a system based on pre-trained models as a foundation for future interactive systems for indexing scientific publications. 4. Address the challenge of lacking a unified classification system when comparing "narrowness" across different disciplines, the interdisciplinarity or impact range of work, and the relative performance among scholars with similar interests. To achieve these goals, the paper adopts a supervised machine learning approach and utilizes a large amount of data from the Microsoft Academic Graph (MAG) database. Additionally, the paper discusses how to combine existing disciplinary classification information, such as the list of academic fields on Wikipedia and specific disciplinary classification systems (e.g., JEL classification for economics, ACM classification for computer science), to create a global classification system with roughly the same level of granularity. Finally, the paper demonstrates how to use advanced architectures such as Convolutional Neural Networks, Recurrent Neural Networks, and Transformers to conduct experiments and evaluate classification performance in both single-label and multi-label settings.