Text mining arXiv: a look through quantitative finance papers

Michele Leonardo Bianchi
2024-04-05
Abstract:This paper explores articles hosted on the arXiv preprint server with the aim to uncover valuable insights hidden in this vast collection of research. Employing text mining techniques and through the application of natural language processing methods, we examine the contents of quantitative finance papers posted in arXiv from 1997 to 2022. We extract and analyze crucial information from the entire documents, including the references, to understand the topics trends over time and to find out the most cited researchers and journals on this domain. Additionally, we compare numerous algorithms to perform topic modeling, including state-of-the-art approaches.
Digital Libraries,Information Retrieval,General Finance
What problem does this paper attempt to address?
The main aim of this paper is to explore research literature in the field of quantitative finance on the arXiv preprint server through text mining techniques, in order to uncover valuable hidden information and trends within these documents. Specifically, the paper has two objectives: 1. **Topic Trend Analysis**: By applying natural language processing methods, the paper conducts topic modeling on quantitative finance papers published on arXiv from 1997 to 2022, to identify and describe the research topics and their evolution over these years. This includes evaluating various clustering algorithms and selecting the best-performing one to categorize the papers into 30 topic groups, thereby exploring the popular research directions of different periods. 2. **Key Authors and Journals Identification**: Besides topic trends, the paper also attempts to identify the most influential authors and journals in the field of quantitative finance. This is achieved through data mining techniques, allowing the analysis to be completed without actually reading the content of the papers. To achieve the above objectives, the authors first collected approximately 16,000 quantitative finance papers from arXiv and conducted detailed preprocessing on these papers, including text cleaning, lemmatization, and other steps. Then, by comparing the performance of different topic modeling algorithms (such as K-means, LDA, Word2Vec, Doc2Vec, Top2Vec, and BERTopic), the most effective algorithm was selected for topic analysis. Ultimately, through in-depth mining of the paper data, this study is able to reveal the main research trends, key contributors, and important publications in the field of quantitative finance, thereby providing guidance for the future development of the field.