Leung, W.-Z. orcid.org/0009-0003-4888-1951, Christensen, H. and Goetze, S. (2025) Text-to-dysarthric-speech generation for dysarthric automatic speech recognition: is purely synthetic data enough? In: Speech and Computer: 27th International Conference, SPECOM 2025, Szeged, Hungary, October 13–15, 2025, Proceedings, Part I. SPECOM 2025, 13-15 Oct 2025, Szeged, Hungary. Lecture Notes in Computer Science (LNAI 16187). Springer Cham, pp. 203-216. ISBN: 9783032079558. ISSN: 0302-9743. EISSN: 1611-3349.
Abstract
Recent advancements in text-to-speech (TTS) technology have revolutionised automatic speech recognition (ASR) data augmentation in low-resource settings. In particular, only a few public datasets are available for dysarthric ASR (DASR) and text-to-dysarthric-speech (TTDS) models have addressed data sparsity limitations by increasing training data samples and diversity. In this context, Grad-TTS (G-TTS) has been shown to synthesise speech with accurate dysarthric speech characteristics beneficial for DASR data augmentation; likewise, Matcha-TTS (M-TTS) has recently improved on typical speech synthesis baselines. Recent studies commonly focus on data augmentation (i.e. reference data combined with additional synthetic data). This work analyses Whisper DASR model adaptation performance using reference data and G-TTS & M-TTS generated data, and shows that comparable performance can be achieved using synthesised data only relative to reference data. Additionally, despite growing work on dysarthric data augmentation, the validation of typical TTS metrics for synthetic dysarthric data, and the development of TTDS metrics requires further research. Results of this work show that gold standard metrics for typical TTS and current dysarthric speech assessment metrics lack sensitivity to predict DASR performance and hence a phoneme posteriorgram (PPG) distance based on the Jensen-Shannon divergence (JS) as a metric for dysarthric speech synthesis is introduced, showing correlation with downstream word error rate (WER) scores.
Metadata
| Item Type: | Proceedings Paper |
|---|---|
| Authors/Creators: |
|
| Copyright, Publisher and Additional Information: | © 2025 The Author(s). Except as otherwise noted, this author-accepted version of a journal article published in Speech and Computer: 27th International Conference, SPECOM 2025, Szeged, Hungary, October 13–15, 2025, Proceedings, Part I is made available via the University of Sheffield Research Publications and Copyright Policy under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ |
| Keywords: | Dysarthric speech recognition; Text-to-speech synthesis; Dysarthric TTS metrics |
| Dates: |
|
| Institution: | The University of Sheffield |
| Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
| Funding Information: | Funder Grant number Engineering and Physical Sciences Research Council 2738353 |
| Date Deposited: | 15 Aug 2025 14:48 |
| Last Modified: | 14 Oct 2025 10:42 |
| Status: | Published |
| Publisher: | Springer Cham |
| Series Name: | Lecture Notes in Computer Science |
| Refereed: | Yes |
| Identification Number: | 10.1007/978-3-032-07956-5_14 |
| Related URLs: | |
| Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:230109 |

CORE (COnnecting REpositories)
CORE (COnnecting REpositories)