David M. Howcroft
David M. Howcroft
Home
Posts
Publications
Talks
Contact
Light
Dark
Automatic
human evaluation
What happens if you treat ordinal ratings as interval data? Human evaluations in NLP are even more under-powered than you think
Previous work has shown that human evaluations in NLP are notoriously under-powered. Here, we argue that there are two common factors …
David M. Howcroft
,
Verena Rieser
PDF
Code
DOI
Disentangling 20 years of confusion: the need for standards in human evaluation
Human assessment remains the most trusted form of evaluation in natural language generation, among other areas of NLP, but there is huge variation in terms of both what is assessed and how it is assessed.
9 Jul 2021
Disentangling 20 years of confusion: quo vadis, human evaluation?
29 Mar 2021
Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions
Human assessment remains the most trusted forrm of evaluation in NLG, but highly diverse approaches and a proliferation of different …
David M. Howcroft
,
Anya Belz
,
Miruna Clinciu
,
Dimitra Gkatzia
,
Sadid A. Hasan
,
Saad Mahamood
,
Simon Mille
,
Emiel van Miltenburg
,
Sashank Santhanam
,
Verena Rieser
PDF
Dataset
Slides
ACL Anthology
Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing
Current standards for designing and reporting human evaluations in NLP mean it is generally unclear which evaluations are comparable …
Anya Belz
,
Simon Mille
,
David M. Howcroft
PDF
ACL Anthology
Cite
×