LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
Anna Bavaresco,Raffaella Bernardi,Leonardo Bertolazzi,Desmond Elliott,Raquel Fernández,Albert Gatt,Esam Ghaleb,Mario Giulianelli,Michael Hanna,Alexander Koller,André F. T. Martins,Philipp Mondorf,Vera Neplenbroek,Sandro Pezzelle,Barbara Plank,David Schlangen,Alessandro Suglia,Aditya K Surikuchi,Ece Takmaz,Alberto Testoni
2024-06-26
Abstract:There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments. In the absence of a comparison against human data, this raises concerns about the validity of these evaluations; in case they are conducted with proprietary models, this also raises concerns over reproducibility. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show that each LLM exhibits a large variance across datasets in its correlation to human judgments. We conclude that LLMs are not yet ready to systematically replace human judges in NLP.
Computation and Language