ChatGPT may Pass the Bar Exam soon, but has a Long Way to Go for the LexGLUE benchmark

Ilias Chalkidis
2023-03-10
Abstract:Following the hype around OpenAI's ChatGPT conversational agent, the last straw in the recent development of Large Language Models (LLMs) that demonstrate emergent unprecedented zero-shot capabilities, we audit the latest OpenAI's GPT-3.5 model, `gpt-3.5-turbo', the first available ChatGPT model, in the LexGLUE benchmark in a zero-shot fashion providing examples in a templated instruction-following format. The results indicate that ChatGPT achieves an average micro-F1 score of 47.6% across LexGLUE tasks, surpassing the baseline guessing rates. Notably, the model performs exceptionally well in some datasets, achieving micro-F1 scores of 62.8% and 70.2% in the ECtHR B and LEDGAR datasets, respectively. The code base and model predictions are available for review on <a class="link-external link-https" href="https://github.com/coastalcph/zeroshot_lexglue" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The paper primarily explores the performance of the latest large language model (LLM) — ChatGPT (specifically version `gpt-3.5-turbo`) in the task of legal text classification. Researchers evaluated ChatGPT using the LexGLUE benchmark dataset under zero-shot and few-shot settings. The results show that ChatGPT performs excellently on certain specific datasets (such as ECtHR B and LEDGAR), achieving micro-F1 scores of 62.8% and 70.1%, respectively. However, overall, ChatGPT's performance is still significantly lower than that of fine-tuned smaller models, indicating that although it possesses some legal knowledge, it should still be used with caution in practical applications. Additionally, the study found that improving the instruction template can slightly enhance the model's performance, but there remains a considerable gap compared to fine-tuned models. In summary, ChatGPT's performance in legal text classification tasks, without specialized training, is still insufficient to meet the demands of production environments.