Data-driven natural language generation (NLG) is not a new concept. For decades, researchers have been studying corpora to inform their development of NLG systems. More recently, this interest has shifted away from rule-based systems to fully end-to-end statistical & neural systems.
In this talk I will present an approach which fills the gap between these extremes by leveraging Bayesian non-parametric methods to learn sentence planning rules for use in a conventional NLG pipeline. This system learns sentence planning rules between (1) tree-structured text plans encoding propositions & rhetorical relations and (2) tree- structured morphosyntactic representations which serve as input to an existing surface realizer.
We train the system on two corpora: the SPaRKy Restaurant Corpus (Walker et al. 2007) and the Extended SPaRKy Restauarant Corpus (Howcroft et al. 2017). The former provides a convenient testbed, as it contains the output of an earlier rule-based NLG system with limited lexical and syntactic variation. The latter addresses these shortcomings and is moreover designed to contain texts which vary with respect to their linguistic complexity.
As our evaluation is ongoing, I will conclude by presenting some examples of the rules the system learns and the texts it is able to generate, before opening the discussion to further possible and planned extensions.