Abstract:As large language models become increasingly prevalent in the financial sector, there is a pressing need for a standardized method to comprehensively assess their performance. However, existing finance benchmarks often suffer from limited language and task coverage, as well as challenges such as low-quality datasets and inadequate adaptability for LLM evaluation. To address these limitations, we propose "Golden Touchstone", the first comprehensive bilingual benchmark for financial LLMs, which incorporates representative datasets from both Chinese and English across eight core financial NLP tasks. Developed from extensive open source data collection and industry-specific demands, this benchmark includes a variety of financial tasks aimed at thoroughly assessing models' language understanding and generation capabilities. Through comparative analysis of major models on the benchmark, such as GPT-4o Llama3, FinGPT and FinMA, we reveal their strengths and limitations in processing complex financial information. Additionally, we open-sourced Touchstone-GPT, a financial LLM trained through continual pre-training and financial instruction tuning, which demonstrates strong performance on the bilingual benchmark but still has limitations in specific <a class="link-external link-http" href="http://tasks.This" rel="external noopener nofollow">this http URL</a> research not only provides the financial large language models with a practical evaluation tool but also guides the development and optimization of future research. The source code for Golden Touchstone and model weight of Touchstone-GPT have been made publicly available at \url{<a class="link-external link-https" href="https://github.com/IDEA-FinAI/Golden-Touchstone" rel="external noopener nofollow">this https URL</a>}, contributing to the ongoing evolution of FinLLMs and fostering further research in this critical area.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the financial field, there are some key problems in the existing large - language - models (LLMs) evaluation benchmarks, including but not limited to: 1. **Insufficient language and task coverage**: Existing financial evaluation benchmarks often cover only a limited number of languages (mainly English) and task types, lacking support for multilingual environments, especially insufficient coverage of Chinese. 2. **Low - quality data sets**: Many existing benchmarks use low - quality data sets, which directly affects the accuracy and reliability of model evaluation. 3. **Insufficient adaptability**: Existing benchmarks may not fully consider the characteristics of large - language models in their design, resulting in evaluation results that cannot fully reflect the true performance of the models. To address these problems, the paper proposes "Golden Touchstone", which is a comprehensive bilingual benchmark aimed at evaluating large - language models in the financial field. This benchmark contains representative data sets from Chinese and English, covering eight core financial natural - language - processing (NLP) tasks, thus providing a more systematic and high - quality evaluation tool, which is helpful for promoting the development and optimization of large - language models in the financial field.

Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models

CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models

The FinBen: an Holistic Financial Benchmark for Large Language Models

FinBen: A Holistic Financial Benchmark for Large Language Models

FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models

Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset

FinGPT: Open-Source Financial Large Language Models

FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models

CFBenchmark: Chinese Financial Assistant Benchmark for Large Language Model

FinGPT: Democratizing Internet-scale Data for Financial Large Language Models

Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance

SNFinLLM: Systematic and Nuanced Financial Domain Adaptation of Chinese Large Language Models

FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets

No Language is an Island: Unifying Chinese and English in Financial Large Language Models, Instruction Data, and Benchmarks

Evaluating Large Language Models on Financial Report Summarization: An Empirical Study

BloombergGPT: A Large Language Model for Finance

Data-centric financial large language models

A Survey of Large Language Models in Finance (FinLLMs)