Leung, W.-Z. orcid.org/0009-0003-4888-1951, Christensen, H. and Goetze, S. (Accepted: 2025) Text-to-dysarthric-speech generation for dysarthric automatic speech recognition: is purely synthetic data enough? In: Speech and Computer: 27th International Conference, SPECOM 2025 Szeged, Hungary, October 13-14, 2025, Proceedings. SPECOM 2025, 13-14 Oct 2025, Szeged, Hungary. Lecture Notes in Computer Science . Springer Cham ISSN: 0302-9743 EISSN: 1611-3349 (In Press)
Abstract
Recent advancements in text-to-speech (TTS) technology have revolutionised automatic speech recognition (ASR) data augmentation in low-resource settings. In particular, only a few public datasets are available for dysarthric ASR (DASR) and text-to-dysarthric-speech (TTDS) models have addressed data sparsity limitations by increasing training data samples and diversity. In this context, Grad-TTS (G-TTS) has been shown to synthesise speech with accurate dysarthric speech characteristics beneficial for DASR data augmentation; likewise, Matcha-TTS (M-TTS) has recently improved on typical speech synthesis baselines.
Recent studies commonly focus on data augmentation (i.e. reference data combined with additional synthetic data). This work analyses Whisper DASR model adaptation performance using reference data and G-TTS & M-TTS generated data, and shows that comparable performance can be achieved using synthesised data only relative to reference data. Additionally, despite growing work on dysarthric data augmentation, the validation of typical TTS metrics for synthetic dysarthric data, and the development of TTDS metrics requires further research. Results of this work show that gold standard metrics for typical TTS and current dysarthric speech assessment metrics lack sensitivity to predict DASR performance and hence a phoneme posteriorgram (PPG) distance based on the Jensen-Shannon divergence (JS) as a metric for dysarthric speech synthesis is introduced, showing correlation with downstream word error rate (WER) scores.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2025 The Author(s). |
Keywords: | Dysarthric speech recognition; Text-to-speech synthesis; Dysarthric TTS metrics |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Funding Information: | Funder Grant number Engineering and Physical Sciences Research Council 2738353 |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 15 Aug 2025 14:48 |
Last Modified: | 15 Aug 2025 14:48 |
Status: | In Press |
Publisher: | Springer Cham |
Series Name: | Lecture Notes in Computer Science |
Refereed: | Yes |
Related URLs: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:230109 |
Download
Filename: _WING__SPECOM_2025.pdf
