Selected Publications

More Publications

Natural language generation (NLG) systems rely on corpora for both hand-crafted approaches in a traditional NLG architecture and for statistical end-to-end (learned) generation systems. Limitations in existing resources, however, make it difficult to develop systems which can vary the linguistic properties of an utterance as needed. For example, when users' attention is split between a linguistic and a secondary task such as driving, a generation system may need to reduce the information density of an utterance to compensate for the reduction in user attention.

We introduce a new corpus in the restaurant recommendation and comparison domain, collected in a paraphrasing paradigm, where subjects wrote texts targeting either a general audience or an elderly family member. This design resulted in a corpus of more than 5000 texts which exhibit a variety of lexical and syntactic choices and differ with respect to average word & sentence length and surprisal. The corpus includes two levels of meaning representation: flat 'semantic stacks' for propositional content and Rhetorical Structure Theory (RST) relations between these propositions.

While previous research on readability has typically focused on document-level measures, recent work in areas such as natural language generation has pointed out the need of sentence-level readability measures. Much of psycholinguistics has focused for many years on processing measures that provide difficulty estimates on a word-by-word basis. However, these psycholinguistic measures have not yet been tested on sentence readability ranking tasks. In this paper, we use four psycholinguistic measures: idea density, surprisal, integration cost, and embedding depth to test whether these features are predictive of readability levels. We find that psycholinguistic features significantly improve performance by up to 3 percentage points over a standard document-level readability metric baseline.

Recent Publications

More Publications

  • The Extended SPaRKy Restaurant Corpus: designing a corpus with variable information density

    2017. In INTERSPEECH (Conference proceedings)

    Details PDF

  • Psycholinguistic Models of Sentence Processing Improve Sentence Readability Ranking

    2017. In EACL (Conference proceedings)

    Details PDF Poster ACL Anthology

  • German morphosyntactic change is consistent with an optimal encoding hypothesis

    2017. At DGfS (Abstract)

    Details PDF

Recent & Upcoming Talks

Recent Posts

OpenCCG is a Java library which can handle both parsing and generation. I’ve mostly used it for surface realization, converting fairly syntactic meaning representations into a natural language text, but you can use it for parsing or for generation from higher-level semantic representations if you’d like. This tutorial is intended to help you: Start exploring OpenCCG with the tccg utility. If you haven’t installed OpenCCG yet, see the first post on Installing OpenCCG first.


Installing OpenCCG can seem intimidating, but it’s not so bad, really. Here I’ve tried to reduce the README to the necessary details while providing a little bit of extra explanation when it seemed helpful.


My website was stagnant. I love Markdown. This looks like a good way to motivate myself to update regularly. And I even found a responsive theme!



Adapting generation to users under cognitive load

This is the natural language generation side of SFB 1102 project A4 ‘Language Comprehension and Cognitive Control Demands: Adapting Information Density to Changing Situations and Individual Users’.

GerMorphIT: Exploring German Morphology with Information Theory

Work with Cynthia A. Johnson and Rory Turnbull on understanding changes in the German adjectival system in terms of information theory