WhiSQA: Non-intrusive speech quality prediction using whisper encoder features

This is the latest version of this eprint.

Close, G., Hong, K., Hain, T. orcid.org/0000-0003-0939-3464 et al. (1 more author) (2025) WhiSQA: Non-intrusive speech quality prediction using whisper encoder features. In: Karpov, A. and Gosztolya, G., (eds.) Speech and Computer. 27th International Conference on Speech and Computer, SPECOM 2025, 13-14 Oct 2025, Szeged, Hungary. Lecture Notes in Computer Science, 16187 (Part 1). Springer Nature Switzerland, pp. 39-51. ISBN: 9783032079558. ISSN: 0302-9743. EISSN: 1611-3349.

Abstract

There has been significant research effort developing neural-network-based predictors of speech quality (SQ) in recent years. While a primary objective has been to develop non-intrusive, i.e. reference-free, metrics to assess the performance of speech enhancement (SE) systems, recent work has also investigated the direct inference of neural SQ predictors within the loss function of downstream speech tasks. To aid in the training of SQ predictors, several large datasets of audio with corresponding human labels of quality have been created. Recent work in this area has shown that speech representations derived from large unsupervised or semi-supervised foundational speech models are useful input feature representations for neural SQ prediction. In this work, a novel and robust SQ predictor is proposed based on feature representations extracted from an automatic speech recognition (ASR) model, found to be a powerful input feature for the SQ prediction task. The proposed system achieves higher correlation with human mean opinion score (MOS) ratings than recent approaches on all NISQA test sets and shows significantly better domain adaption compared to the commonly used DNSMOS metric.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Close, G. Hong, K. Hain, T. https://orcid.org/0000-0003-0939-3464 Goetze, S.
Editors:	Karpov, A. https://orcid.org/0000-0003-3424-652X Gosztolya, G. https://orcid.org/0000-0002-2864-6466
Copyright, Publisher and Additional Information:	© 2025 The Authors. Except as otherwise noted, this author-accepted version of a paper published in Speech and Computer is made available via the University of Sheffield Research Publications and Copyright Policy under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/
Keywords:	Acoustics; Predictive markers; Speech and Audio Processing; Speech and Audio Signal Processing; Speech Perception; Structure Prediction
Dates:	Published (online): 13 October 2025 Published: 13 October 2025
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Funding Information:	Funder Grant number TOSHIBA EUROPE LTD UNSPECIFIED TOSHIBA EUROPE LTD CDT-2021-GC Engineering and Physical Sciences Research Council EP/S023062/1 ENGINEERING AND PHYSICAL SCIENCE RESEARCH COUNCIL EP/S023062/1
Date Deposited:	16 Oct 2025 14:38
Last Modified:	04 Nov 2025 14:15
Status:	Published
Publisher:	Springer Nature Switzerland
Series Name:	Lecture Notes in Computer Science
Refereed:	Yes
Identification Number:	10.1007/978-3-032-07956-5_3
Related URLs:	Author arXiv URL Conference
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:233118

Available Versions of this Item

WhiSQA: Non-intrusive speech quality prediction using whisper encoder features. (deposited 04 Nov 2025 14:18)
- WhiSQA: Non-intrusive speech quality prediction using whisper encoder features. (deposited 16 Oct 2025 14:38) [Currently Displayed]

Download

Accepted Version

Filename: 2508.02210v1.pdf

Licence: CC-BY 4.0

CLICK TO DOWNLOAD

CORE (COnnecting REpositories)