I primarily work on natural language generation (NLG), developing new machine learning methods for data-to-text generation, collecting corpora to train systems, and conducting human evaluations (usually through crowdsourcing). I’m also interested in psycholinguistics, linguistic representations, and Bayesian statistics. For fun I like to sing 🎤🎶 and play video games 🎮🖥️.
In order for computers to produce natural language texts from non-linguistic information, we need a system for mapping between the two, a system of Natural Language Generation (NLG). We can reduce the difficulty of developing such systems if we leverage machine learning intelligently. While there are many possible approaches to the task, this thesis argues for one in particular, focusing on sentence planning using synchronous grammars and Bayesian nonparametric methods.
We formulate sentence planning rules in terms of Synchronous Tree Substitution Grammars (sTSGs) and implement a series of hierarchical Dirichlet Processes along with a Gibbs sampler to learn such rules from appropriate corpora. Due to the lack of corpora which pair hierarchical, discourse-structured meaning representations with varied texts, we developed a new interface for crowdsourcing training corpora for NLG systems by asking participants to produce paraphrases of pre-existing texts and collected a new corpus, which we call the Extended SPaRKy Restaurant Corpus (ESRC).
After training our models on pre-existing, lexically-restricted corpora as well as the ESRC, we conduct a series of human evaluations using a novel evaluation interface. This interface enables the assessment of the fluency, semantic fidelity, and expression of discourse relations in a text in a single crowdsourcing experiment. While we identify several limitations to our approach, the evaluations suggest that our models can outperform existing neural network models with respect to semantic fidelity and in some cases maintain similar levels of fluency.
In addition to these efforts, we present a Dependency Attachment Grammar (DAG) based on (Joshi & Rambow, 2003) and extend this grammar to the synchronous setting so that future work can build upon its added flexibility relative to sTSG. In addition to these practically-oriented efforts, we also explore human variation in adapting their utterances to listeners under cognitive load through a psycholinguistic study.
This thesis opens up several directions for future research into how best to integrate the various challenging tasks involved in natural language generation and how best to evaluate these systems in the future.
Many domains and tasks in natural language generation (NLG) are inherently ‘low-resource’, where training data, tools and linguistic analyses are scarce. This poses a particular challenge to researchers and system developers in the era of machine-learning-driven NLG. In this position paper, we initially present the challenges researchers & developers often encounter when dealing with low-resource settings in NLG. We then argue that it is unsustainable to collect large aligned datasets or build large language models from scratch for every possible domain due to cost, labour, and time constraints, so researching and developing methods and resources for low-resource settings is vital. We then discuss current approaches to low-resource NLG, followed by proposed solutions and promising avenues for future work in NLG for low-resource settings.
Previous work has shown that human evaluations in NLP are notoriously under-powered. Here, we argue that there are two common factors which make this problem even worse: NLP studies usually (a) treat ordinal data as interval data and (b) operate under high variance settings while the differences they are hoping to detect are often subtle. We demonstrate through simulation that ordinal mixed effects models are better able to detect small differences between models, especially in high variance settings common in evaluations of generated texts. We release tools for researchers to conduct their own power analysis and test their assumptions. We also make recommendations for improving statistical power.
Human assessment remains the most trusted forrm of evaluation in NLG, but highly diverse approaches and a proliferation of different quality criteria used by researchers make it difficult to compare results and draw conclusions across papers, with adverse implication sfor meta-evaluation and reproducibility. Int his paper, we present (i) our dataset of 165 NLG papers with human evaluations, (ii) the annotation scheme we developed to label the papers for different aspects of evaluations, (iii) quantitative analyses of the annotations, and (iv) a set of recommendations for improving standards in evaluation reporting. We use the annotations as a basis for examining information included in evaluation reprots, and levels of consistency in approaches, experimental design and terminology, focusing in particular on the 200+ different terms that have been used for evaluated aspects of quality. We conclude that due to a pervasive lack of clarity in reports and extreme diversity in approaches, human evaluation in NLG presents as extremely confused in 2020, and that the field is in urgent need of standard methods and terminology.
Current standards for designing and reporting human evaluations in NLP mean it is generally unclear which evaluations are comparable and can be expected to yield similar results when applied to the same system outputs. This has serious implications for reproducibility testing and meta-evaluation, in particular given that human evaluation is considered the gold standard against which the trustworthiness of automatic metrics is gauged. %and merging others, as well as deciding which evaluations should be able to reproduce each other’s results. Using examples from NLG, we propose a classification system for evaluations based on disentangling (i) what is being evaluated (which aspect of quality), and (ii) how it is evaluated in specific (a) evaluation modes and (b) experimental designs. We show that this approach provides a basis for determining comparability, hence for comparison of evaluations across papers, meta-evaluation experiments, reproducibility testing.
Over the last decade, crowdsourcing has become a standard method for collecting training data for NLP tasks and evaluating NLP systems …
Postdoc on the NLG for Low-Resource Domains project.
Postdoc on the Madrigal project.
Student researcher on SFB 1102 Project A4 and in the LSV.