Yue, Z., Loweimi, E., Christensen, H. orcid.org/0000-0003-3028-5062 et al. (2 more authors) (2022) Acoustic modelling from raw source and filter components for dysarthric speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30. pp. 2968-2980. ISSN 2329-9290
Abstract
Acoustic modelling for automatic dysarthric speech recognition (ADSR) is a challenging task. Data deficiency is a major problem and substantial differences between typical and dysarthric speech complicate the transfer learning. In this paper, we aim at building acoustic models using the raw magnitude spectra of the source and filter components for ADSR. The proposed multi-stream models consist of convolutional, recurrent and fully-connected layers allowing for pre-processing various information streams and fusing them at an optimal level of abstraction. We demonstrate that such a multi-stream processing leverages information encoded in the vocal tract and excitation components and leads to normalising nuisance factors such as speaker attributes and speaking style. This leads to a better handling of dysarthric speech that exhibits large inter- and intra-speaker variabilities and results in a notable performance gain. Furthermore, we analyse the learned convolutional filters and visualise the outputs of different layers after dimensionality reduction to demonstrate how the speaker-related attributes are normalised along the pipeline. We also compare the proposed multi-stream model with various systems based on MFCC, FBank, raw waveform and i-vector, and, study the training dynamics as well as usefulness of the feature normalisation and data augmentation via speed perturbation. On the widely used TORGO and UASpeech dysarthric speech corpora, the proposed approach leads to a competitive performance of up to 35.3% and 30.3% WERs for dysarthric speech, respectively.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works. Reproduced in accordance with the publisher's self-archiving policy. |
Keywords: | Dysarthric automatic speech recognition; multi-stream acoustic modelling; source-filter separation and fusion |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) The University of Sheffield > Faculty of Medicine, Dentistry and Health (Sheffield) > School of Health and Related Research (Sheffield) > ScHARR - Sheffield Centre for Health and Related Research |
Funding Information: | Funder Grant number EUROPEAN COMMISSION - HORIZON 2020 766287 - TAPAS |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 26 Oct 2022 17:39 |
Last Modified: | 23 Sep 2023 00:13 |
Status: | Published |
Publisher: | Institute of Electrical and Electronics Engineers (IEEE) |
Refereed: | Yes |
Identification Number: | 10.1109/taslp.2022.3205766 |
Related URLs: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:192463 |