Disentangling 20 years of confusion: quo vadis, human evaluation?

Crowdsourcing and evaluating text quality

Over the last decade, crowdsourcing has become a standard method for collecting training data for NLP tasks and evaluating NLP systems for things like text quality. Many evaluations, however, are still ill-defined. In the practical portion of this …