Close, G., Hong, K., Hain, T. orcid.org/0000-0003-0939-3464 et al. (1 more author) (2025) WhiSQA: Non-intrusive speech quality prediction using whisper encoder features. In: Karpov, A. and Gosztolya, G., (eds.) Speech and Computer. 27th International Conference on Speech and Computer, SPECOM 2025, 13-14 Oct 2025, Szeged, Hungary. Lecture Notes in Computer Science, 16187 (Part 1). Springer Nature Switzerland, pp. 39-51. ISBN: 9783032079558. ISSN: 0302-9743. EISSN: 1611-3349.
Abstract
There has been significant research effort developing neural-network-based predictors of speech quality (SQ) in recent years. While a primary objective has been to develop non-intrusive, i.e. reference-free, metrics to assess the performance of speech enhancement (SE) systems, recent work has also investigated the direct inference of neural SQ predictors within the loss function of downstream speech tasks. To aid in the training of SQ predictors, several large datasets of audio with corresponding human labels of quality have been created. Recent work in this area has shown that speech representations derived from large unsupervised or semi-supervised foundational speech models are useful input feature representations for neural SQ prediction. In this work, a novel and robust SQ predictor is proposed based on feature representations extracted from an automatic speech recognition (ASR) model, found to be a powerful input feature for the SQ prediction task. The proposed system achieves higher correlation with human mean opinion score (MOS) ratings than recent approaches on all NISQA test sets and shows significantly better domain adaption compared to the commonly used DNSMOS metric.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Editors: |
|
Copyright, Publisher and Additional Information: | © 2025 The Authors. Except as otherwise noted, this author-accepted version of a paper published in Speech and Computer is made available via the University of Sheffield Research Publications and Copyright Policy under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ |
Keywords: | Acoustics; Predictive markers; Speech and Audio Processing; Speech and Audio Signal Processing; Speech Perception; Structure Prediction |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Funding Information: | Funder Grant number TOSHIBA EUROPE LTD UNSPECIFIED TOSHIBA EUROPE LTD CDT-2021-GC Engineering and Physical Sciences Research Council EP/S023062/1 ENGINEERING AND PHYSICAL SCIENCE RESEARCH COUNCIL EP/S023062/1 |
Date Deposited: | 16 Oct 2025 14:38 |
Last Modified: | 16 Oct 2025 16:06 |
Status: | Published |
Publisher: | Springer Nature Switzerland |
Series Name: | Lecture Notes in Computer Science |
Refereed: | Yes |
Identification Number: | 10.1007/978-3-032-07956-5_3 |
Related URLs: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:233118 |