Text-to-dysarthric-speech generation for dysarthric automatic speech recognition: is purely synthetic data enough?

Leung, W.-Z. orcid.org/0009-0003-4888-1951, Christensen, H. and Goetze, S. (2025) Text-to-dysarthric-speech generation for dysarthric automatic speech recognition: is purely synthetic data enough? In: Speech and Computer: 27th International Conference, SPECOM 2025, Szeged, Hungary, October 13–15, 2025, Proceedings, Part I. SPECOM 2025, 13-15 Oct 2025, Szeged, Hungary. Lecture Notes in Computer Science (LNAI 16187). Springer Cham, pp. 203-216. ISBN: 9783032079558. ISSN: 0302-9743. EISSN: 1611-3349.

Abstract

Recent advancements in text-to-speech (TTS) technology have revolutionised automatic speech recognition (ASR) data augmentation in low-resource settings. In particular, only a few public datasets are available for dysarthric ASR (DASR) and text-to-dysarthric-speech (TTDS) models have addressed data sparsity limitations by increasing training data samples and diversity. In this context, Grad-TTS (G-TTS) has been shown to synthesise speech with accurate dysarthric speech characteristics beneficial for DASR data augmentation; likewise, Matcha-TTS (M-TTS) has recently improved on typical speech synthesis baselines. Recent studies commonly focus on data augmentation (i.e. reference data combined with additional synthetic data). This work analyses Whisper DASR model adaptation performance using reference data and G-TTS & M-TTS generated data, and shows that comparable performance can be achieved using synthesised data only relative to reference data. Additionally, despite growing work on dysarthric data augmentation, the validation of typical TTS metrics for synthetic dysarthric data, and the development of TTDS metrics requires further research. Results of this work show that gold standard metrics for typical TTS and current dysarthric speech assessment metrics lack sensitivity to predict DASR performance and hence a phoneme posteriorgram (PPG) distance based on the Jensen-Shannon divergence (JS) as a metric for dysarthric speech synthesis is introduced, showing correlation with downstream word error rate (WER) scores.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Leung, W.-Z. https://orcid.org/0009-0003-4888-1951 Christensen, H. Goetze, S.
Copyright, Publisher and Additional Information:	© 2025 The Author(s). Except as otherwise noted, this author-accepted version of a journal article published in Speech and Computer: 27th International Conference, SPECOM 2025, Szeged, Hungary, October 13–15, 2025, Proceedings, Part I is made available via the University of Sheffield Research Publications and Copyright Policy under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/
Keywords:	Dysarthric speech recognition; Text-to-speech synthesis; Dysarthric TTS metrics
Dates:	Accepted: 4 August 2025 Published (online): 13 October 2025 Published: 13 October 2025
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Funding Information:	Funder Grant number Engineering and Physical Sciences Research Council 2738353
Date Deposited:	15 Aug 2025 14:48
Last Modified:	14 Oct 2025 10:42
Status:	Published
Publisher:	Springer Cham
Series Name:	Lecture Notes in Computer Science
Refereed:	Yes
Identification Number:	10.1007/978-3-032-07956-5_14
Related URLs:	Conference
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:230109

Download

Accepted Version

Filename: _WING__SPECOM_2025.pdf

Licence: CC-BY 4.0

CLICK TO DOWNLOAD

CORE (COnnecting REpositories)

Text-to-dysarthric-speech generation for dysarthric automatic speech recognition: is purely synthetic data enough?

Abstract

Metadata

Download

Accepted Version

Export

Statistics