human evaluation

Disentangling 20 years of confusion: the need for standards in human evaluation

Human assessment remains the most trusted form of evaluation in natural language generation, among other areas of NLP, but there is huge variation in terms of both what is assessed and how it is assessed. We recently surveyed 20 years of publications in the NLG community to better understand this variation and conclude that we need to work together to develop clear standards for human evaluations.

Disentangling 20 years of confusion: quo vadis, human evaluation?

Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions

Human assessment remains the most trusted forrm of evaluation in NLG, but highly diverse approaches and a proliferation of different quality criteria used by researchers make it difficult to compare results and draw conclusions across papers, with …

Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing

Current standards for designing and reporting human evaluations in NLP mean it is generally unclear which evaluations are comparable and can be expected to yield similar results when applied to the same system outputs. This has serious implications …