Can GPT models be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on mock CFA Exams

Ethan Callanan,Amarachi Mbakwe,Antony Papadimitriou,Yulong Pei,Mathieu Sibue,Xiaodan Zhu,Zhiqiang Ma,Xiaomo Liu,Sameena Shah

2023-10-13

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance on a wide range of Natural Language Processing (NLP) tasks, often matching or even beating state-of-the-art task-specific models. This study aims at assessing the financial reasoning capabilities of LLMs. We leverage mock exam questions of the Chartered Financial Analyst (CFA) Program to conduct a comprehensive evaluation of ChatGPT and GPT-4 in financial analysis, considering Zero-Shot (ZS), Chain-of-Thought (CoT), and Few-Shot (FS) scenarios. We present an in-depth analysis of the models' performance and limitations, and estimate whether they would have a chance at passing the CFA exams. Finally, we outline insights into potential strategies and improvements to enhance the applicability of LLMs in finance. In this perspective, we hope this work paves the way for future studies to continue enhancing LLMs for financial reasoning through rigorous evaluation.

Computation and Language,Artificial Intelligence,General Finance

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the performance of large - language models (LLMs) in financial analysis tasks, especially whether these models are competent for the complex financial reasoning problems involved in the Chartered Financial Analyst (CFA) exam. Specifically, the researchers used CFA mock exam questions to comprehensively evaluate the performance of ChatGPT and GPT - 4 in the field of financial analysis, considering three scenarios: zero - shot (Zero - Shot, ZS), chain - of - thought (Chain - of - Thought, CoT), and few - shot (Few - Shot, FS). The main objectives of the study include: 1. **First comprehensive evaluation**: Conduct a comprehensive evaluation of the capabilities of ChatGPT and GPT - 4 on financial reasoning problems, using CFA mock exam questions and covering three prompting methods: zero - shot, chain - of - thought, and few - shot. 2. **Performance and limitation analysis**: Thoroughly analyze the performance and limitations of these models when solving financial reasoning problems, and estimate their likelihood of passing the CFA Level 1 and Level 2 exams. 3. **Strategies and improvement suggestions**: Propose potential strategies and improvement measures to enhance the application of LLMs in the financial field and open up new directions for future research and development. Through these evaluations, the researchers hope to reveal the practical application potential of LLMs in the financial field and provide guidance for future improvements.

Can GPT models be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on mock CFA Exams

Is ChatGPT a Financial Expert? Evaluating Language Models on Financial Natural Language Processing

Financial Statement Analysis with Large Language Models

Revolutionizing Finance with LLMs: An Overview of Applications and Insights

Large language models in cryptocurrency securities cases: can a GPT model meaningfully assist lawyers?

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

Beyond Classification: Financial Reasoning in State-of-the-Art Language Models

How Much Does ChatGPT Know about Finance?

GPT-3 Models are Few-Shot Financial Reasoners

Can ChatGPT Overcome Behavioral Biases in the Financial Sector? Classify-and-Rethink: Multi-Step Zero-Shot Reasoning in the Gold Investment

Efficiently Measuring the Cognitive Ability of LLMs: an Adaptive Testing Perspective

A Scoping Review of ChatGPT Research in Accounting and Finance

The model student: GPT-4 performance on graduate biomedical science exams

Large Language Models and Generative AI in Finance: An Analysis of ChatGPT, Bard, and Bing AI

Assessing Large Language Models in Mechanical Engineering Education: A Study on Mechanics-Focused Conceptual Understanding

GPT-3.5, GPT-4, or BARD? Evaluating LLMs Reasoning Ability in Zero-Shot Setting and Performance Boosting Through Prompts

FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models

Can LLMs be Good Financial Advisors?: An Initial Study in Personal Decision Making for Optimized Outcomes

The Promise and Peril of Generative AI: Evidence from GPT-4 as Sell-Side Analysts