LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks. Bavaresco, A., Bernardi, R., Bertolazzi, L., Elliott, D., Fernández, R., Gatt, A., Ghaleb, E., Giulianelli, M., Hanna, M., Koller, A., Martins, A. F. T., Mondorf, P., Neplenbroek, V., Pezzelle, S., Plank, B., Schlangen, D., Suglia, A., Surikuchi, A. K., Takmaz, E., & Testoni, A. In Proceedings of the of the 63rd Conference of the Association for Computational Linguistics (ACL 2025), 2025.

Paper

Github abstract bibtex 8 downloads

There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments. In the absence of a comparison against human data, this raises concerns about the validity of these evaluations; in case they are conducted with proprietary models, this also raises concerns over reproducibility. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show that each LLM exhibits a large variance across datasets in its correlation to human judgments. We conclude that LLMs are not yet ready to systematically replace human judges in NLP.

@inproceedings{judgebench-acl-2025,
  title={LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks},
  author={Anna Bavaresco and Raffaella Bernardi and Leonardo Bertolazzi and Desmond Elliott and Raquel Fern\'{a}ndez and Albert Gatt and Esam Ghaleb and Mario Giulianelli and Michael Hanna and Alexander Koller and Andr\'{e} F. T. Martins and Philipp Mondorf and Vera Neplenbroek and Sandro Pezzelle and Barbara Plank and David Schlangen and Alessandro Suglia and Aditya K. Surikuchi and Ece Takmaz and Alberto Testoni},
  year={2025},
  booktitle={Proceedings of the of the 63rd Conference of the Association for Computational Linguistics (ACL 2025)},     
  url={https://aclanthology.org/2025.acl-short.20/},
  url_github = {https://github.com/dmg-illc/JUDGE-BENCH},
  abstract={There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments. 
In the absence of a comparison against human data, this raises concerns about the validity of these evaluations; 
in case they are conducted with proprietary models, this also raises concerns over reproducibility. We provide JUDGE-BENCH, 
a collection of 20 NLP datasets with human annotations, and comprehensively evaluate 11 current LLMs, covering both open-weight 
and proprietary models, for their ability to replicate the annotations. Our evaluations show that each LLM exhibits a large 
variance across datasets in its correlation to human judgments. We conclude that LLMs are not yet ready to systematically 
replace human judges in NLP.}
}

Downloads: 8