Pan, Y., Mirheidari, B., Blackburn, D. et al. (1 more author) (2025) A two-step attention-based feature combination cross-attention system for speech-based dementia detection. IEEE Transactions on Audio, Speech and Language Processing, 33. pp. 896-907. ISSN 1063-6676
Abstract
Dementia poses a significant global challenge, with profound personal, societal, and economic impacts. Although it is incurable, early detection is crucial for ensuring appropriate care and support. Dementia can impair a person's speech and language abilities, and studies have demonstrated promising results in using spoken language for automatic dementia detection. Recently, deep learning-based self-supervised learning models, such as wav2vec2.0 (w2v) and BERT, have shown success in extracting acoustic and linguistic information. However, most studies have relied on single datasets and relatively straightforward methods for extracting and combining acoustic and linguistic modalities. This paper presents an in-depth exploration of the application of SSL models in this context by proposing the Two-Step Attention-based Feature Combination Cross-attention system (TSAC-ATT) for speech-based dementia detection. The contributions of this paper are as follows: i) we explore and analyse acoustic and linguistic feature extraction pipelines using SSL models, including the proposed TSAC framework to create high-performing acoustic features from w2v's contextual layers; ii) we demonstrate that these features, when fused using cross-attention, outperform various feature combination approaches; iii) all experimental work is conducted on two publicly available datasets (DementiaBank and ADReSS), as well as the IVA dataset collected by the Royal Hallamshire Hospital, which includes recordings of the standard Cookie Theft task. We present state-of-the-art results, highlighting that acoustic-only features based on the w2v model can achieve very high performance across multiple datasets. Furthermore, we show that the upstream performance of the automatic speech recognition module does not always predict downstream classification performance.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2025 The Author(s). Except as otherwise noted, this author-accepted version of a journal article published in IEEE Transactions on Audio, Speech and Language Processing is made available via the University of Sheffield Research Publications and Copyright Policy under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ |
Keywords: | Dementia detection; wav2vec2.0; BERT; feature fusion; cross-attention |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Funding Information: | Funder Grant number DEPARTMENT OF HEALTH AND SOCIAL CARE NIHR202911 (30003) EUROPEAN COMMISSION - HORIZON 2020 766287 - TAPAS |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 23 Jan 2025 15:55 |
Last Modified: | 12 Mar 2025 16:57 |
Status: | Published |
Publisher: | Institute of Electrical and Electronics Engineers |
Refereed: | Yes |
Identification Number: | 10.1109/TASLPRO.2025.3533363 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:222213 |