Workshops, etc, on Evaluating Text Quality
Over the past year (writing in April 2020) I have become increasingly interested in how we conduct evaluations in the NLG community. We generally agree that human evaluation is the best tool we’ve got, but we don’t even agree on what we want to measure (e.g. clarity, readability, fluency, naturalness; adequacy, informativeness, truthfulness). As part of my efforts to dig into this topic with Verena Rieser, I’ve been learning more about the different workshops and tasks that have been run over the years.
This page serves as an archive of these resources, primarily for myself, but stored publicly in case it is helpful for others as well. The list is (necessarily) incomplete, because I am adding things as I remember them, but if something important is missing, please let me know!
Unless otherwise specified, these focus on the evaluation of text or speech in a dialect of English.
(Emphasis in quotes added by me.)
Current Events
- CHAVAL is a JSALT summer workshop aimed at both NLU and automated evaluation for open-domain spoken dialogue systems. (NB: I am involved with this.)
- Eval4NLP focuses on “the [design] of adequate metrics for evaluating performance in high-level text generation tasks…; properly evaluating word and sentence embeddings; and rigorously determining whether and under which conditions one system is better than another; etc.”
- eval.gen.chal is a mailing list originating in the SIGGEN community following INLG 2019 which brings together folks interested in organizing Evaluating Generation Challenges (the abbreviation’s a bit wonky…). I am leading this group, and we’d love to have more active participants working to make NLG evaluation more consistent and meaningful.
- EVALITA “promote[s] the development of language and speech technologies for the Italian language, providing a shared framework where different systems and approaches can be evaluated in a consistent manner.”
Past Events
- ConvAI
- WMT tasks
- the Document Understanding Conferences which organized shared tasks including an element of text quality assessment.
- The 2001 evaluation protocol still requires a password to access
- The 2002 human evaluation protocol
- Task description with definitions and links to human evaluation protocol from 2003
- Quality Questions from 2004
- Quality Questions from 2005
- Quality Questions from 2006
- Quality Questions from 2007
Datasets
- NLG system outputs with evaluation data
- WMT & GEC post-edit data
- Data from digital humanities or survey & information delivery design?
Other Resources
- Emiel van Miltenburg sketched a proposal for a shared task on NLG evaluation
- This project on evaluating artificial social agents addresses related issues, in addition to a bunch of dimensions of HRI which are not relevant to text quality.