human evaluation | David M. Howcroft

What happens if you treat ordinal ratings as interval data? Human evaluations in NLP are even more under-powered than you think

Previous work has shown that human evaluations in NLP are notoriously under-powered. Here, we argue that there are two common factors …

David M. Howcroft, Verena Rieser

Disentangling 20 years of confusion: the need for standards in human evaluation

Human assessment remains the most trusted form of evaluation in natural language generation, among other areas of NLP, but there is huge variation in terms of both what is assessed and how it is assessed.

9 Jul 2021

Disentangling 20 years of confusion: quo vadis, human evaluation?

29 Mar 2021

Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions

Human assessment remains the most trusted forrm of evaluation in NLG, but highly diverse approaches and a proliferation of different …

David M. Howcroft, Anya Belz, Miruna Clinciu, Dimitra Gkatzia, Sadid A. Hasan, Saad Mahamood, Simon Mille, Emiel van Miltenburg, Sashank Santhanam, Verena Rieser

Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing

Current standards for designing and reporting human evaluations in NLP mean it is generally unclear which evaluations are comparable …

Anya Belz, Simon Mille, David M. Howcroft