Abstract:Introduction and Motivation There is a concerted effort to study science of science in multiple spheres. However, a clear gap exists in how to incorporate digital outputs, such as software, as an integral component in scholarly communication. This tension has become aggravated in recent years because software can be the end products in many scientific inquiries. Therefore, there is the need to build a framework to assess the impact of software in science. One cornerstone in the framework is the design of textbased methods to identify software entities in fulltext corpora because these entities are largely mentioned in the text rather than formally cited in the way as their publications counterpart. This research-in-progress paper will serve this purpose by the development and evaluation of a bootstrapping method to automatically extract software entities from a full-text data set. Despite the effort of indexing digital outputs such as Thomson Reuters’ Data Citation Index or SageCite by University of Bath, U.K., the use of full-text data is necessary to identify patterns of software references because these digital outputs are referenced in unsystematical ways in scientific literature. They can be embedded in documents by digital object identifiers (DOIs), hyperlinks, and featured on dedicated websites or simply be mentioned in paragraphs, footnotes, endnotes, acknowledgements, or supplementary materials. A 2014 citation study on three oceanographic data sets showed that these digital outputs are more likely to be mentioned in the text than formally cited (Belter, 2014). Intuitively, one would think of curating a list of software names; however, it will not be feasible due to the velocity, variety, and volume of software that has been developed and applied constantly. Thus, merely using metadata or static listings is incapable of capturing the full extent of the impact of software. Instead, full-text publication data provide the crucial context for this purpose. This study will use a bootstrapping method to identify software uses in a full-text data set. It will allow us to expand the impact and attribution mechanism by assessing the impact of software. Methods The bootstrapping method is used to extract software entities from full-text papers. It is a selfsustaining technique used to iteratively improve a classifier’s performance through seed terms (Riloff & Jones, 1999; Riloff, Wiebe, & Wilson, 2003). The bootstrapping process contains the following steps: (1) Label seed terms or learned entities in the text. Seed terms are used in the first iteration, and learned entities are used in other iterations. (2) Generate contextual patterns of seed terms in the first iteration, and create contextual patterns of learned entities in other iterations. (3) Score these contextual patterns and select top ranked N patterns as candidate patterns. (4) Score entities extracted by candidate patterns and select top ranked M entities as learned entities. (5) Go back to the first step until the system cannot learn any new positive entities. The calculation of pattern scores and entity scores determine the effectiveness of the bootstrapping method. If a pattern gets a higher score, then it is selected into the candidate pattern pool. Entities extracted by these candidate patterns are considered as candidate entities. To boost the performance, we incorporated three heuristic rules to the calculation of pattern scores. The first feature is an unlabeled entity containing at least one uppercase letter. An entity with this feature gets a score of 1 if it contains one or more uppercase alphabetic letters; otherwise, it gets a score less than 1. The second feature focuses on version numbers. An entity with this feature gets a score of 1 if a version number is collocated. The third and fourth features deal with the presence of trigger words: a score of 1 if the left context (third feature) or right context (fourth feature) of an entity contains trigger words.

A Bootstrapping Method to Assess Software Impact in Full-Text Papers.

Assessing the Impact of Software on Science: A Bootstrapped Learning of Software Entities in Full-Text Papers.

Using the full-text content of academic articles to identify and evaluate algorithm entities in the domain of natural language processing

Machine Identification of High Impact Research through Text and Image Analysis

Best practices to evaluate the impact of biomedical research software—metric collection beyond citations

High-Impact Bug Report Identification with Imbalanced Learning Strategies

From Words to Worth: Newborn Article Impact Prediction with LLM

A Bootstrapping Approach to Entity Linkage on the Semantic Web.

Detecting and analyzing missing citations to published scientific entities

Improving Software Text Retrieval Using Conceptual Knowledge in Source Code.

Bootstrapping Information Extraction Via Conceptualization

Challenges of measuring the impact of software: an examination of the lme4 R package

Bootstrapping Large-scale Named Entities Using URL-Text Hybrid Patterns.

Semantic Analysis for Automated Evaluation of the Potential Impact of Research Articles

A Neural Network-Powered Cognitive Method of Identifying Semantic Entities in Earth Science Papers

Leveraging web resources for keyword assignment to short text documents

Discovering Entities with Just a Little Help from You

StatSnowball: a statistical approach to extracting entity relationships.

Investigating Software Usage in the Social Sciences: A Knowledge Graph Approach

A Technical Report: Entity Extraction Using Both Character-based and Token-based Similarity

How Important is Software to Library and Information Science Research? A Content Analysis of Full-Text Publications