Abstract:In recent years, Large Language Models (LLMs) have demonstrated remarkable versatility across various applications, including natural language understanding, domain-specific knowledge tasks, etc. However, applying LLMs to complex, high-stakes domains like finance requires rigorous evaluation to ensure reliability, accuracy, and compliance with industry standards. To address this need, we conduct a comprehensive and comparative study on three state-of-the-art LLMs, GLM-4, Mistral-NeMo, and LLaMA3.1, focusing on their effectiveness in generating automated financial reports. Our primary motivation is to explore how these models can be harnessed within finance, a field demanding precision, contextual relevance, and robustness against erroneous or misleading information. By examining each model's capabilities, we aim to provide an insightful assessment of their strengths and limitations. Our paper offers benchmarks for financial report analysis, encompassing proposed metrics such as ROUGE-1, BERT Score, and LLM Score. We introduce an innovative evaluation framework that integrates both quantitative metrics (e.g., precision, recall) and qualitative analyses (e.g., contextual fit, consistency) to provide a holistic view of each model's output quality. Additionally, we make our financial dataset publicly available, inviting researchers and practitioners to leverage, scrutinize, and enhance our findings through broader community engagement and collaborative improvement. Our dataset is available on huggingface.

What problem does this paper attempt to address?

This paper aims to address the application problems of large - language models (LLMs) in the automatic generation of financial reports. Specifically, the authors hope to explore how these models can be effectively utilized in the financial field by evaluating the effectiveness of three state - of - the - art large - language models (GLM - 4, Mistral - NeMo, and LLaMA3.1) in generating automated financial reports. The characteristics of the financial field require that model outputs must have high precision, context - relevance, and resistance to false or misleading information. Therefore, this study not only focuses on the technical performance of the models but also emphasizes the practical application value of the models in the financial field. To achieve this goal, the researchers designed a comprehensive evaluation framework that combines quantitative indicators (such as precision, recall, etc.) and qualitative analysis (such as context - adaptability, consistency, etc.) to comprehensively evaluate the output quality of each model. In addition, they also made public the financial data sets used for the research, encouraging more researchers and practitioners to participate in this work, and promoting the application development of LLMs in the financial field through extensive participation and collaborative improvement in the community. Overall, this paper attempts to identify the advantages and limitations of different LLMs in the task of financial report generation through a systematic evaluation method, thereby providing guidance for future research and practical applications.

Evaluating Large Language Models on Financial Report Summarization: An Empirical Study

A Survey of Large Language Models for Financial Applications: Progress, Prospects and Challenges

Data-centric financial large language models

Large Language Models in Finance: A Survey

Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language Models

A Survey of Large Language Models in Finance (FinLLMs)

Large Language Model Adaptation for Financial Sentiment Analysis

Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries

Towards reducing hallucination in extracting information from financial reports using Large Language Models

Auto-Generating Earnings Report Analysis via a Financial-Augmented LLM

Leveraging LLMs for KPIs Retrieval from Hybrid Long-Document: A Comprehensive Framework and Dataset.

Large Language Models as Financial Data Annotators: A Study on Effectiveness and Efficiency

Is ChatGPT a Financial Expert? Evaluating Language Models on Financial Natural Language Processing

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

CatMemo at the FinLLM Challenge Task: Fine-Tuning Large Language Models using Data Fusion in Financial Applications

A Survey on Evaluation of Large Language ModelsJust Accepted

A Survey on Evaluation of Large Language Models

FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models