Artificial Intuition: Efficient Classification of Scientific Abstracts

Harsh Sakhrani,Naseela Pervez,Anirudh Ravi Kumar,Fred Morstatter,Alexandra Graddy Reed,Andrea Belz
2024-07-09
Abstract:It is desirable to coarsely classify short scientific texts, such as grant or publication abstracts, for strategic insight or research portfolio management. These texts efficiently transmit dense information to experts possessing a rich body of knowledge to aid interpretation. Yet this task is remarkably difficult to automate because of brevity and the absence of context. To address this gap, we have developed a novel approach to generate and appropriately assign coarse domain-specific labels. We show that a Large Language Model (LLM) can provide metadata essential to the task, in a process akin to the augmentation of supplemental knowledge representing human intuition, and propose a workflow. As a pilot study, we use a corpus of award abstracts from the National Aeronautics and Space Administration (NASA). We develop new assessment tools in concert with established performance metrics.
Artificial Intelligence
What problem does this paper attempt to address?
This paper mainly explores how to effectively classify scientific literature abstracts in a coarse-grained manner, which is a challenging task because abstracts are often information-dense and lack context. The researchers propose a new approach called "Artificial Intuition" to address this problem by generating and appropriately assigning domain-specific coarse labels. They use a large language model (LLM) to provide the necessary metadata, similar to the process of enhancing human intuition, and propose a workflow. In the specific operation, the researchers first use a keyword extraction algorithm to extract key terms from the abstracts, and then use LLM to generate relevant background information for these keywords, clustering these enhanced documents for classification. They use NASA's SBIR project abstracts as a pilot case and develop new evaluation tools combined with standard performance metrics. Two main requirements mentioned in the paper are: (1) create a unified, coarse-grained, non-overlapping classification system suitable for uniquely categorizing a group of documents; (2) develop an unsupervised method that avoids relying on manual annotations while effectively handling the characteristics of scientific text, especially for abstracts. The researchers generate a label space through k-means clustering and propose a coverage measure to evaluate if the labels comprehensively describe the document space. Additionally, they analyze the influence of different clustering numbers on redundancy and coverage, and how to predict labels through threshold selection to achieve high precision and recall. Finally, the paper discusses the potential applications of this approach, including validation on a wider range of document sets, handling long documents, generating multiple labels, and the potential application in business and public policy fields, such as tracking research trends or industry classification through label-generated metadata.