Identifying Banking Transaction Descriptions via Support Vector Machine Short-Text Classification Based on a Specialized Labelled Corpus

Silvia García-Méndez,Milagros Fernández-Gavilanes,Jonathan Juncal-Martínez,Francisco J. González-Castaño,Oscar Barba Seara
DOI: https://doi.org/10.1109/ACCESS.2020.2983584
2024-03-29
Abstract:Short texts are omnipresent in real-time news, social network commentaries, etc. Traditional text representation methods have been successfully applied to self-contained documents of medium size. However, information in short texts is often insufficient, due, for example, to the use of mnemonics, which makes them hard to classify. Therefore, the particularities of specific domains must be exploited. In this article we describe a novel system that combines Natural Language Processing techniques with Machine Learning algorithms to classify banking transaction descriptions for personal finance management, a problem that was not previously considered in the literature. We trained and tested that system on a labelled dataset with real customer transactions that will be available to other researchers on request. Motivated by existing solutions in spam detection, we also propose a short text similarity detector to reduce training set size based on the Jaccard distance. Experimental results with a two-stage classifier combining this detector with a SVM indicate a high accuracy in comparison with alternative approaches, taking into account complexity and computing time. Finally, we present a use case with a personal finance application, CoinScrap, which is available at Google Play and App Store.
Information Retrieval,Artificial Intelligence,Computational Engineering, Finance, and Science,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the problem of automatically classifying bank transaction descriptions (BT descriptions) for personal financial management. Specifically, the authors propose a new system that combines natural language processing (NLP) techniques and machine learning (ML) algorithms to classify the short text descriptions of bank transactions. This problem has not been considered in previous literature. ### Main Issues 1. **Insufficient Information**: Bank transaction descriptions are usually very short and contain limited information, making effective classification difficult. 2. **Domain-Specific Characteristics**: The terms and vocabulary in bank transaction descriptions are specific, requiring the use of these domain-specific characteristics for classification. 3. **Real-Time Generation**: Bank transaction descriptions are generated in real-time, necessitating efficient classification methods to handle large volumes of data. ### Solutions 1. **Feature Extraction**: Use features such as character and word n-grams to represent short texts. 2. **Support Vector Machine (SVM)**: Use SVM as the classifier, combined with features for classification. 3. **Similarity Detection**: Introduce a similarity detector based on Jaccard distance to reduce the size of the training set and improve efficiency. ### Experimental Results - Through cross-validation, the system demonstrated high accuracy across different training and test data set splits. - Compared to other existing methods, this system performs better in terms of classification effectiveness, especially regarding complexity and computation time. ### Application Case - The system has been applied to a personal financial management application called CoinScrap, which is available for download on Google Play and the App Store. ### Summary The paper proposes a novel system that can effectively classify bank transaction descriptions, thereby helping financial institutions better manage and analyze customer data and improve decision-making accuracy.