I work on linguistic complexity and natural language generation. This means I spend a lot of time thinking about adaptive generation, which requires NLG systems which produce a variety of outputs suitable to different levels of linguistic complexity.
MSc in Language Science & Technology, 2015
Universität des Saarlandes, Saarbrücken, Germany
AB in Linguistics; BS in Mathematics, 2010
University of Gerogia, Athens, GA, USA
Neural natural language generation (NNLG) systems are known for their pathological outputs, i.e. generating text which is unrelated to the input specification. In this paper, we show the impact of semantic noise on state-of-theart NNLG models which implement different semantic control mechanisms. We find that cleaned data can improve semantic correctness by up to 97%, while maintaining fluency. We also find that the most common error is omitting information, rather than hallucination.
Developing conventional natural language generation systems requires extensive attention from human experts in order to craft complex sets of sentence planning rules. We propose a Bayesian nonparametric approach to learn sentence planning rules by inducing synchronous tree substitution grammars for pairs of text plans and morphosyntactically-specified dependency trees. Our system is able to learn rules which can be used to generate novel texts after training on small datasets.
Natural language generation (NLG) systems rely on corpora for both hand-crafted approaches in a traditional NLG architecture and for statistical end-to-end (learned) generation systems. Limitations in existing resources, however, make it difficult to develop systems which can vary the linguistic properties of an utterance as needed. For example, when users’ attention is split between a linguistic and a secondary task such as driving, a generation system may need to reduce the information density of an utterance to compensate for the reduction in user attention. We introduce a new corpus in the restaurant recommendation and comparison domain, collected in a paraphrasing paradigm, where subjects wrote texts targeting either a general audience or an elderly family member. This design resulted in a corpus of more than 5000 texts which exhibit a variety of lexical and syntactic choices and differ with respect to average word & sentence length and surprisal. The corpus includes two levels of meaning representation: flat ‘semantic stacks’ for propositional content and Rhetorical Structure Theory (RST) relations between these propositions.
Over the last decade, crowdsourcing has become a standard method for collecting training data for NLP tasks and evaluating NLP systems …
Data-driven natural language generation (NLG) is not a new concept. For decades, researchers have been studying corpora to inform their …