Corporate fraud detection based on linguistic readability vector: Application to financial companies in China
Yi Zhang,Tianxiang Liu,Weiping Li
DOI: https://doi.org/10.1016/j.irfa.2024.103405
IF: 8.235
2024-06-27
International Review of Financial Analysis
Abstract:Existing research on corporate fraud identification mainly uses text data disclosed by companies to construct models. However, the semantic text information is lost after vectorizing text data using natural language processing methods. Based on the linguistic features of Chinese texts, we construct a new Chinese character-level readability index, a Chinese word-level readability index, a Chinese sentence-level readability index, and a Chinese paragraph-level readability index, and consider them together to define for the first time linguistic readability vectors of Chinese text. This paper takes A-share companies in the financial industry listed on the Shanghai and Shenzhen stock exchanges from 2005 to 2019 as the research object, and uses the natural language processing method, Word2Vec, to vectorize management's discussion and analysis (MD&A) of the company's annual reports. We then use machine learning algorithms to construct fraud identification models by using the readability vector data to complement the MD&A semantically. The empirical results show that the performance of all three types of machine learning models improves after supplementing with the semantic information of the readability vector, with the support vector machine improving the most significantly, with 31.17%, 2.56%, 26.33%, and 2.45% improvement in accuracy, recall, F1-score, and AUC, respectively. This not only enriches the semantic interpretation of Chinese annual reports but also improves the empirical effectiveness of fraud recognition models.
business, finance