Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models

Xiaojun Wu,Junxi Liu,Huanyi Su,Zhouchi Lin,Yiyan Qi,Chengjin Xu,Jiajun Su,Jiajie Zhong,Fuwei Wang,Saizhuo Wang,Fengrui Hua,Jia Li,Jian Guo
2024-11-10
Abstract:As large language models become increasingly prevalent in the financial sector, there is a pressing need for a standardized method to comprehensively assess their performance. However, existing finance benchmarks often suffer from limited language and task coverage, as well as challenges such as low-quality datasets and inadequate adaptability for LLM evaluation. To address these limitations, we propose "Golden Touchstone", the first comprehensive bilingual benchmark for financial LLMs, which incorporates representative datasets from both Chinese and English across eight core financial NLP tasks. Developed from extensive open source data collection and industry-specific demands, this benchmark includes a variety of financial tasks aimed at thoroughly assessing models' language understanding and generation capabilities. Through comparative analysis of major models on the benchmark, such as GPT-4o Llama3, FinGPT and FinMA, we reveal their strengths and limitations in processing complex financial information. Additionally, we open-sourced Touchstone-GPT, a financial LLM trained through continual pre-training and financial instruction tuning, which demonstrates strong performance on the bilingual benchmark but still has limitations in specific <a class="link-external link-http" href="http://tasks.This" rel="external noopener nofollow">this http URL</a> research not only provides the financial large language models with a practical evaluation tool but also guides the development and optimization of future research. The source code for Golden Touchstone and model weight of Touchstone-GPT have been made publicly available at \url{<a class="link-external link-https" href="https://github.com/IDEA-FinAI/Golden-Touchstone" rel="external noopener nofollow">this https URL</a>}, contributing to the ongoing evolution of FinLLMs and fostering further research in this critical area.
Computation and Language,Computational Engineering, Finance, and Science
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the financial field, there are some key problems in the existing large - language - models (LLMs) evaluation benchmarks, including but not limited to: 1. **Insufficient language and task coverage**: Existing financial evaluation benchmarks often cover only a limited number of languages (mainly English) and task types, lacking support for multilingual environments, especially insufficient coverage of Chinese. 2. **Low - quality data sets**: Many existing benchmarks use low - quality data sets, which directly affects the accuracy and reliability of model evaluation. 3. **Insufficient adaptability**: Existing benchmarks may not fully consider the characteristics of large - language models in their design, resulting in evaluation results that cannot fully reflect the true performance of the models. To address these problems, the paper proposes "Golden Touchstone", which is a comprehensive bilingual benchmark aimed at evaluating large - language models in the financial field. This benchmark contains representative data sets from Chinese and English, covering eight core financial natural - language - processing (NLP) tasks, thus providing a more systematic and high - quality evaluation tool, which is helpful for promoting the development and optimization of large - language models in the financial field.