☕️

Research Fellow in Natural Language Generation

Edinburgh Napier University

I primarily work on natural language generation (NLG), developing new machine learning methods for data-to-text generation, collecting corpora to train systems, and conducting human evaluations (usually through crowdsourcing). I’m also interested in psycholinguistics, linguistic representations, and Bayesian statistics. For fun I like to sing 🎤🎶 and play video games 🎮🖥️.

Featured Publications

David M. Howcroft, William Lamb, Anna Groundwater, Dimitra Gkatzia

September, 2023 INLG

Building a dual dataset of text- and image-grounded conversations and summarisation in Gàidhlig (Scottish Gaelic)

Gàidhlig (Scottish Gaelic; gd) is spoken by about 57k people in Scotland, but remains an under-resourced language with respect to natural language processing in general and natural language generation (NLG) in particular. To address this gap, we developed the first datasets for Scottish Gaelic NLG, collecting both conversational and summarisation data in a single setting. Our task setup involves dialogues between a pair of speakers discussing museum exhibits, grounding the conversation in images and texts. Then, both interlocutors summarise the dialogue resulting in a secondary dialogue summarisation dataset. This paper presents the dialogue and summarisation corpora, as well as the software used for data collection. The corpus consists of 43 conversations (13.7k words) and 61 summaries (2.0k words), and will be released along with the data collection interface.

David M. Howcroft, Dimitra Gkatzia

December, 2022 GEM

Most NLG is Low-Resource: here's what we can do about it

Many domains and tasks in natural language generation (NLG) are inherently ‘low-resource’, where training data, tools and linguistic analyses are scarce. This poses a particular challenge to researchers and system developers in the era of machine-learning-driven NLG. In this position paper, we initially present the challenges researchers & developers often encounter when dealing with low-resource settings in NLG. We then argue that it is unsustainable to collect large aligned datasets or build large language models from scratch for every possible domain due to cost, labour, and time constraints, so researching and developing methods and resources for low-resource settings is vital. We then discuss current approaches to low-resource NLG, followed by proposed solutions and promising avenues for future work in NLG for low-resource settings.

David M. Howcroft, Verena Rieser

November, 2021 EMNLP

What happens if you treat ordinal ratings as interval data? Human evaluations in NLP are even more under-powered than you think

Previous work has shown that human evaluations in NLP are notoriously under-powered. Here, we argue that there are two common factors which make this problem even worse: NLP studies usually (a) treat ordinal data as interval data and (b) operate under high variance settings while the differences they are hoping to detect are often subtle. We demonstrate through simulation that ordinal mixed effects models are better able to detect small differences between models, especially in high variance settings common in evaluations of generated texts. We release tools for researchers to conduct their own power analysis and test their assumptions. We also make recommendations for improving statistical power.

David M. Howcroft, Anya Belz, Miruna Clinciu, Dimitra Gkatzia, Sadid A. Hasan, Saad Mahamood, Simon Mille, Emiel van Miltenburg, Sashank Santhanam, Verena Rieser

December, 2020 INLG

Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions

Human assessment remains the most trusted forrm of evaluation in NLG, but highly diverse approaches and a proliferation of different quality criteria used by researchers make it difficult to compare results and draw conclusions across papers, with adverse implication sfor meta-evaluation and reproducibility. Int his paper, we present (i) our dataset of 165 NLG papers with human evaluations, (ii) the annotation scheme we developed to label the papers for different aspects of evaluations, (iii) quantitative analyses of the annotations, and (iv) a set of recommendations for improving standards in evaluation reporting. We use the annotations as a basis for examining information included in evaluation reprots, and levels of consistency in approaches, experimental design and terminology, focusing in particular on the 200+ different terms that have been used for evaluated aspects of quality. We conclude that due to a pervasive lack of clarity in reports and extreme diversity in approaches, human evaluation in NLG presents as extremely confused in 2020, and that the field is in urgent need of standard methods and terminology.

Anya Belz, Simon Mille, David M. Howcroft

December, 2020 INLG

Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing

Current standards for designing and reporting human evaluations in NLP mean it is generally unclear which evaluations are comparable and can be expected to yield similar results when applied to the same system outputs. This has serious implications for reproducibility testing and meta-evaluation, in particular given that human evaluation is considered the gold standard against which the trustworthiness of automatic metrics is gauged. %and merging others, as well as deciding which evaluations should be able to reproduce each other’s results. Using examples from NLG, we propose a classification system for evaluations based on disentangling (i) what is being evaluated (which aspect of quality), and (ii) how it is evaluated in specific (a) evaluation modes and (b) experimental designs. We show that this approach provides a basis for determining comparability, hence for comparison of evaluations across papers, meta-evaluation experiments, reproducibility testing.

See all publications

Recent & Upcoming Talks

Low-resource NLG - developing tools and building a Scottish Gaelic dataset

Most NLG is low-resource, even in high-resource languages (Howcroft & Gkatzia 2022). In this talk I will highlight our efforts to develop a data collection paradigm which focuses on question answering conversations and (dialogue) summarisation for low-resource languages.

21 Nov 2022

Disentangling 20 years of confusion: the need for standards in human evaluation

Human assessment remains the most trusted form of evaluation in natural language generation, among other areas of NLP, but there is huge variation in terms of both what is assessed and how it is assessed.

9 Jul 2021

Natural Language Generation and Human Language Production: a history and an opportunity

21 May 2021

Disentangling 20 years of confusion: quo vadis, human evaluation?

29 Mar 2021

Crowdsourcing and evaluating text quality

Over the last decade, crowdsourcing has become a standard method for collecting training data for NLP tasks and evaluating NLP systems …

20 Jan 2020

See all

Contact

The easiest way to reach me for academic/research/work related inquiries is via my Napier email address.

D.Howcroft@napier.ac.uk
School of Computing, Engineering, & the Built Environment, Edinburgh Napier University, Merchiston, Edinburgh, Scotland, EH9 1QX
@_dmh
@_dmh@tech.lgbt
0000-0002-0810-9065
Semantic Scholar Profile
Google Scholar Profile
dmhowcroft

Recent employment

Research Fellow

Edinburgh Napier University

June 2021 – Present Edinburgh, Scotland

Postdoc on the NLG for Low-Resource Domains project.

Acquired small grant funding for “Scottish Gaelic Generation for Exhibits”
Implement neural NLG models (existing & novel)
Conduct crowdsourcing experiments to build datasets and evaluate systems

Research Associate

Heriot-Watt University

June 2019 – April 2021 Edinburgh, Scotland

Postdoc on the Madrigal project.

Led research into best practices for human evaluation of NLG systems.
Developed statistical programs for analysing rating scale data.
Contributed to research on semantic control of neural NLG systems.

Wissenschaftlicher Mitarbeiter

Saarland University

October 2014 – May 2019 Saarbrücken, Germanz

Student researcher on SFB 1102 Project A4 and in the LSV.

Developed Bayesian nonparametric methods for inducing synchronous grammars for NLG.
Designed a new human evaluation interface for NLG systems.
Created a new corpus for data-to-text NLG with variable information density.
Researched readability prediction based on psycholinguistic measures of online sentence processing.
Helped design psycholinguistic experiments on referring expressions.
Contributed to research on applying AI planning techniques to rule-based NLG systems.

Research Fellow in Natural Language Generation

Edinburgh Napier University

Featured Publications

Recent & Upcoming Talks

Recent Posts

Popular Topics

Contact

Recent employment