Semi-supervised learning for automatic speech recognition with word error rate estimation and targeted domain data selection

Park, C. and Hain, T. orcid.org/0000-0003-0939-3464 (2025) Semi-supervised learning for automatic speech recognition with word error rate estimation and targeted domain data selection. In: Scharenborg, O., Oertel, C. and Truong, K., (eds.) Proceedings of Interspeech 2025. Interspeech 2025, 17-21 Aug 2025, Rotterdam, The Netherlands. International Speech Communication Association (ISCA) , pp. 3663-3667. ISSN: 2958-1796 EISSN: 2958-1796

Abstract

There is a growing demand for leveraging untranscribed multi-domain data in semi-supervised learning (SSL) for automatic speech recognition (ASR) to broaden its applications. However, domain mismatch between source and target data can limit SSL’s performance gains, even when transcript accuracy for training is high. While word error rate (WER) estimation (WE) methods for automatic transcription have advanced, they remain insufficient for handling multi-domain data. This paper proposes a novel data selection method for SSL in ASR that integrates WE and acoustic domain similarity (ADS). For WE, multi-target regression for error rate prediction (MTR-ER) is introduced, while ADS is incorporated as a selection criterion, measured using noise-contrastive estimation. The effectiveness of this approach is demonstrated through comparisons with a confidence-based method. Results show that combining WE and ADS achieves 26.66% of the expected performance improvement of fully supervised learning.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Park, C. Hain, T. https://orcid.org/0000-0003-0939-3464
Editors:	Scharenborg, O. Oertel, C. Truong, K.
Copyright, Publisher and Additional Information:	© 2025 The Authors. Except as otherwise noted, this author-accepted version of a paper published in Proceedings of Interspeech 2025 is made available via the University of Sheffield Research Publications and Copyright Policy under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/
Keywords:	speech recognition; semi-supervised learning; word error rate estimation; acoustic domain similarity
Dates:	Accepted: 3 June 2025 Published (online): 17 August 2025 Published: 17 August 2025
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Depositing User:	Symplectic Sheffield
Date Deposited:	07 Aug 2025 15:03
Last Modified:	20 Aug 2025 13:21
Published Version:	https://www.isca-archive.org/interspeech_2025/park...
Status:	Published
Publisher:	International Speech Communication Association (ISCA)
Refereed:	Yes
Identification Number:	10.21437/Interspeech.2025-191
Related URLs:	Conference
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:230076

Download

Accepted Version

Filename: camera-ready.pdf

Licence: CC-BY 4.0

CLICK TO DOWNLOAD

CORE (COnnecting REpositories)

Semi-supervised learning for automatic speech recognition with word error rate estimation and targeted domain data selection

Abstract

Metadata

Download

Accepted Version

Export

Statistics