Workshops, etc, on Evaluating Text Quality

6 Apr 2020 2 min read lists

Over the past year (writing in April 2020) I have become increasingly interested in how we conduct evaluations in the NLG community. We generally agree that human evaluation is the best tool we’ve got, but we don’t even agree on what we want to measure (e.g. clarity, readability, fluency, naturalness; adequacy, informativeness, truthfulness). As part of my efforts to dig into this topic with Verena Rieser, I’ve been learning more about the different workshops and tasks that have been run over the years.

This page serves as an archive of these resources, primarily for myself, but stored publicly in case it is helpful for others as well. The list is (necessarily) incomplete, because I am adding things as I remember them, but if something important is missing, please let me know!

Unless otherwise specified, these focus on the evaluation of text or speech in a dialect of English.

(Emphasis in quotes added by me.)

Current Events

CHAVAL is a JSALT summer workshop aimed at both NLU and automated evaluation for open-domain spoken dialogue systems. (NB: I am involved with this.)
Eval4NLP focuses on “the [design] of adequate metrics for evaluating performance in high-level text generation tasks…; properly evaluating word and sentence embeddings; and rigorously determining whether and under which conditions one system is better than another; etc.”
eval.gen.chal is a mailing list originating in the SIGGEN community following INLG 2019 which brings together folks interested in organizing Evaluating Generation Challenges (the abbreviation’s a bit wonky…). I am leading this group, and we’d love to have more active participants working to make NLG evaluation more consistent and meaningful.
EVALITA “promote[s] the development of language and speech technologies for the Italian language, providing a shared framework where different systems and approaches can be evaluated in a consistent manner.”

Past Events

ConvAI
WMT tasks
the Document Understanding Conferences which organized shared tasks including an element of text quality assessment.
- The 2001 evaluation protocol still requires a password to access
- The 2002 human evaluation protocol
- Task description with definitions and links to human evaluation protocol from 2003
- Quality Questions from 2004
- Quality Questions from 2005
- Quality Questions from 2006
- Quality Questions from 2007

Datasets

NLG system outputs with evaluation data
WMT & GEC post-edit data
Data from digital humanities or survey & information delivery design?

Other Resources

Emiel van Miltenburg sketched a proposal for a shared task on NLG evaluation
This project on evaluating artificial social agents addresses related issues, in addition to a bunch of dimensions of HRI which are not relevant to text quality.

evaluation workshops shared tasks resources

Research Fellow in Natural Language Generation

Dave Howcroft is a computational linguist working at Edinburgh Napier University.