Non-Intrusive Speech Intelligibility Prediction for Hearing-Impaired Users Using Intermediate ASR Features and Human Memory Models

Neural networks have been successfully used for non-intrusive speech intelligibility prediction. Recently, the use of feature representations sourced from intermediate layers of pre-trained self-supervised and weakly-supervised models has been found to be particularly useful for this task. This work combines the use of Whisper ASR decoder layer representations as neural network input features with an exemplar-based, psychologically motivated model of human memory to predict human intelligibility ratings for hearing-aid users. Substantial performance improvement over an established intrusive HASPI baseline system is found, including on enhancement systems and listeners unseen in the training data, with a root mean squared error of 25.3 compared with the baseline of 28.7.


INTRODUCTION
Hearing loss is a widespread problem that affects approximately 466 million people worldwide (around 6% of the world population), though this problem is only expected to grow; by 2030 it is predicted to impact 630 million people worldwide [1].Age correlates with the chance that a person will be affected by hearing impairment [2], and the population is expected to age.From 2015 to 2050, the proportion of the population aged over 60 is expected to almost double, rising from 12% to 22% [3].As hearing impairment typically worsens in an individual, their ability to make intelligible the speech that they hear decreases.
Successful development of hearing aid (HA) technology [4,5] requires assessment, which is expensive and time-consuming when conducted by human listeners [6,7].Automated intelligibility metrics, which are designed to mimic human assessment, are more costeffective and can also be used as training objectives [8,9].Developing and improving estimators for such metrics is therefore essential.
The Clarity Prediction Challenge 2 (CPC2) [10] builds on the prior Clarity Prediction Challenge 1 (CPC1) [11].It provides a comparatively large dataset of noisy audio processed by hearing aid systems, each with an associated intelligibility score obtained from listening tests with hearing-impaired human listeners.
The challenge task is to predict the intelligibility score of hearing-impaired listeners given the audio and some additional information such as a representation of the hearing loss of the listener.The challenge has both a non-intrusive track, where only the noisy signal processed by the hearing aid can be used, and an intrusive track, where a clean version of the input audio can also be used.
In this work, an intelligibility estimator for the non-intrusive challenge task is proposed, which makes use of a feature transformation of the audio signal, derived from intermediate representations of a pre-trained automatic speech recognition (ASR) model (cf.Section 3.1).This is combined with a model of human memory, which has its origins in the field of human psychology, to predict the intelligibility score of hearing aid output audio (cf.Section 3.2.2).
The remainder of this work is organised as follows: Section 2 briefly describes the CPC2, including the dataset and baseline model.Section 3 covers the proposed features and model architecture.Results are presented in Section 4 and Section 5 concludes the paper.

CLARITY PREDICTION CHALLENGE 2
The Clarity Project runs two challenges in sequence with the aim of improving HA technology: the Clarity Enhancement Challenge (CEC) and the Clarity Prediction Challenge (CPC).The CEC objective is to design systems to enhance noisy signals for hearingimpaired listeners.The CPC objective is to assess the intelligibility of the CEC systems.

Challenge data
The CPC2 data consists of tuples of a speech signal ŝ[n] and its corresponding correctness value i, obtained from listening tests with hearing-impaired listeners.
The signal ŝ[n] is the enhanced outputs of hearing aid systems with binaural input x[n], being an artificially corrupted version of clean reference audio s[n] with additive noise v[n].The correctness value i is the percentage of words which a hearing-impaired listener was able to correctly reproduce from the speech signal ŝ[n] they listened to.The challenge data also contains additional information such as left/right ear's representations of the listeners' hearing loss as audiograms a l and ar.All audio signals are stereo with a left and right channel.
The resulting data is partitioned into three train sets, each paired with a disjoint evaluation set.Each evaluation set covers listeners and hearing aid enhancement systems which are unseen in its corresponding training set, meaning that prediction models need to generalise to unseen listeners and systems.Set 1 has a training set of size 8599 and an evaluation set of size 305 audio samples.Set 2 has a training set of size 8135 and an evaluation set of size 294.Set 3 has a training set of size 7896 and an evaluation set of size 298.There are around 40 hours of audio in total.Audio with very low intelligibility (correctness 0) and very high intelligibility (correctness 100) are over-represented, as shown in Figure 1.

Prior approaches
A number of different approaches were taken in the first iteration of the challenge in CPC1 [11].The best performing non-intrusive approach [12] uses an uncertainty measure derived from state-of-theart ASR systems as a proxy for human intelligibility, finding a strong correlation between the two measures.Other successful approaches [13,14] make use of powerful feature representations derived from self-supervised speech representations (SSSRs) as inputs to neural speech intelligibility prediction models, while others use neural network structures which have been shown to be useful in the related task of human speech quality rating prediction [15].CPC2 differs from CPC1 in that its evaluation sets are disjoint in terms of listener and hearing aid system relative to its training sets.This means that some of the better-performing approaches to CPC1, which operated effectively as predictors of the hearing aid system, were not at all useful in CPC2, which was discovered in early experiments for this work with the CPC2 data.As such, our proposed system for CPC2 builds on the best-performing approaches to CPC1, while ensuring that it can generalise to unseen data.

Challenge baseline
The baseline system provided by the challenge organisers [10] makes use of the Hearing-Aid Speech Perception Index (HASPI), version 2 [16].This is an intrusive system that makes use of both the enhanced noisy signal ŝ[n] and the clean speech signal s[n].A HASPI score is computed for both the left and right ear signals and logistic regression is used to predict the correctness scores from the higher HASPI score.

SYSTEM ARCHITECTURE
This section describes the proposed approach to the CPC2 task.The approach consists of neural networks which take a recent ASR derived representation of the hearing aid output signal ŝ[n] as input and return a prediction î of the correctness value.

Features
The Whisper model [17] is an ASR model pre-trained on 680, 000 hours of multi-lingual data for tasks including English and non-English speech transcription and voice activity detection.The model architecture is an encoder-decoder transformer [18], with 12 encoder layers and 12 decoder layers for the small Whisper model1 used in this work.Given the enhanced speech signal ŝ[n], with corresponding spectrogram representation Ŝ, the input to the proposed intelligibility prediction model is the output of the 12 decoder layers of the Whisper model.The Whisper decoder layers output represents word-level features, with dimension W × 768 × 12, where W is the predicted number of words in the utterance, 768 is the feature dimension of each decoder layer for the small Whisper model, and 12 is the number of decoder layers.Note that while W varies from utterance to utterance, for a given utterance it will remain fixed through the decoder layers.The parameters of the Whisper model are frozen during training of the metric prediction model described below, i.e., the Whisper model is used as a feature transform.

Model structure
An ensemble of two models for speech intelligibility (SI) prediction is used, as shown in Figure 2, and the results of these two models are combined by averaging.
A model structure following work on the CPC1 in [14] is chosen for the primary SI prediction network (cf.Section 3.2.1),depicted to the right in Figure 2.This model structure has previously been successfully applied to the task of human quality label prediction [19] using pre-trained SSSR representations as its input feature.This approach was also found to be useful for the CPC1 task [14], however generalization to unseen hearing aid systems was poor.
The secondary SI prediction network incorporates an exemplarinformed module based on a simplified theory of human memory [20] (cf.Section 3.2.2),and is shown in the middle-right of Figure 2. Humans are believed to make use of specific examples, or exemplars, for memory-based tasks [21,22,23,24].Humans are also able to non-intrusively assess speech signals, i.e., without direct reference.Since the challenge objective is to predict human responses, incorporating an exemplar-informed component may provide benefits or insight.
The output of the ensemble î for a given input signal ŝ[n] is the mean of the outputs of the primary îp and secondary systems îs.

Primary SI prediction model
The model structure uses a learnable weighted sum of the Whisper representations, implemented as a learnable linear layer with 12 parameters, all initialised to 1, followed by a softmax to ensure that the layer weights sum to 1.This representation of dimension W × 768 is then processed by 2 bidirectional long short-term memory (BLSTM) layers with an input size of 768 and a hidden layer size of 384.Finally, an attention pooling feed-forward layer with sigmoid activation outputs to a single neuron which represents the primary predicted correctness value ip normalized between 0 and 1.The primary model has approximately 8.3 M parameters.

Secondary SI prediction model: exemplar-informed
The secondary model differs from the primary model in that the attention-pooling output feeds into an exemplar-informed module based on a simplified theory of human memory [20] The functions f : R 768 → R 768 , g : R 768 → R 768 and h : R → R are all learned affine transformations.The value a is a combination of the exemplar labels, weighted by their similarity to the input.This passes through a single linear neuron to produce r, and then through a sigmoid activation, which yields the secondary model's prediction, îs, normalised to fall between 0 and 1.The exemplar model has approximately 10 M parameters.

Experimental setup
All audio used as input to Whisper is re-sampled to 16kHz and padded/truncated to be 30 seconds long.In the case of the CPC2 data, all recordings were shorter than 30 seconds, so were padded, and downsampled from 32kHz to 16kHz.From this time-domain signal, the 80 channel log magnitude Mel spectrogram is computed, using a window of 25ms and a stride of 10ms.During training, both left and right ear signals are used as independent samples with the same correctness label.During inference, the signal that produces the highest correctness value is used to account for the better-ear effect [25].
For each of the three splits, two listeners and two systems were randomly selected to form a disjoint validation set.All data with these listeners and systems were removed from the training set.A randomly selected non-disjoint validation set consisting of 10% of the remaining training data was also formed.The majority of model selection and hyperparameter tuning was performed using these validation sets, to test generalisation to unseen listeners and systems.For the final models, the disjoint validation set and all listeners/systems associated with it were merged back into the training data to make the best use of resources.The primary and secondary models are trained separately with mean squared error loss.The primary model is trained for 25 epochs with batch size 8, learning rate 10 −5 and weight decay 10 −4 .The secondary model is trained for 50 epochs with learning rate 2×10 −6 and weight decay 10 −4 .During training and validation, D = 8 exemplars are chosen randomly from the training data for each minibatch.

RESULTS AND DISCUSSION
Table 1 shows the results for the primary, secondary and ensemble models on each of the data splits, as well as the overall result for the CPC2 baseline.The validation sets contain enhancement systems and listeners seen in the corresponding training data.The evaluation sets are disjoint, containing only unseen enhancement systems and listeners.The primary, secondary and ensemble prediction networks all beat the baseline on all evaluation sets.

Primary and secondary models
The primary and secondary systems show similar performance on the validation sets, with the ensemble of the two outperforming either.This is not replicated on the evaluation set, where the primary model outperforms the secondary model, and performs equivalently to the ensemble.Figure 3 shows the performance of the ensemble model for different correctness values.The model performs well for very low intelligibility (0 correctness) and for very high intelligibility (100 correctness) but performs less well between the two extremes.This corresponds with the distribution of true correctness scores in the training data (see Figure 1), in which 0 and 100 correctness are overrepresented.

Model performance on unseen enhancement systems
All the models show lower performance on Evaluation Set 1.This appears to stem from the presence of audio enhanced by enhance- ment system E001 (the baseline in Clarity Enhancement Challenge 2) in this evaluation set.Audio enhanced by this system has an average correctness value of 28.7%, which is significantly lower than the average of the other two enhancement systems in Evaluation Set 1, E022 and E031, which have average correctness values of 73.0% and 84.3%, respectively.Although the proposed model can generalise to unseen enhancement systems, it predicts correctness less accurately on enhancement systems with lower performance.Figure 4 shows the proposed model's performance by enhancement system correctness across all enhancement systems.There is a clear trend, with the proposed system more accurately predicting the correctness for better-performing enhancement systems which produce outputs with high correctness ratings.Conversely, the proposed system less accurately predicts outputs from poorly-performing enhancement systems which produce outputs with low correctness ratings.Figure 5 shows the predicted (left) and true (right) correctness across all enhancement systems, showing the proposed model overestimates the correctness of the two worst-performing enhancement systems, E001 and E038, while generally slightly underestimating the correctness of the better-performing enhancement systems.

Whisper layer weights
The learned weights for the Whisper decoder layers from the primary model are shown in Figure 6.These show how the model used the information to weigh each decoder layer feature, and therefore the higher the value the more useful the model finds the layer to be.The weights are shown for each of the three models trained on the different training splits.
The general pattern for the different training splits is similar, with layers 7 and 8 having the highest weights across all splits.This suggests that layers 7 and 8 contain the most relevant information for intelligibility.Interestingly, the model trained on Split 3 learns weights that emphasise layers 7 and 8 more strongly.

Performance comparison
Figure 7 shows the performance of the proposed model (P002) compared to all other challenge entries, as well as the challenge HASPI baseline.Prior is a system which always predicts the average of the intelligibility over the training set, regardless of the input.The proposed system is outperformed by only one other system, P011, which also utilises Whisper-derived features.The difference in performance is very slight, with P011 achieving an RMSE score of 25.1, while our system achieves 25.

CONCLUSIONS
We show that Whisper decoder layers are a useful feature representation for speech intelligibility prediction, with layers 7 and 8 appearing to be the most relevant.
Our proposed system performs substantially better than the HASPI regression baseline and all but one of the other challenge approaches, even outperforming intrusive systems which had access to the clean reference signal.It is able to generalise to unseen enhancement systems and listeners.
This work was supported by the Centre for Doctoral Training in Speech and Language Technologies (SLT) and their Applications funded by UK Research and Innovation [grant number EP/S023062/1].This work was also supported by Toshiba and WS Audiology.

Fig. 1 .
Fig. 1.Distribution of true correctness values in the training data.

Fig. 4 .
Fig. 4. Model performance by mean hearing aid system correctness.

Fig. 5 .
Fig. 5. Performance of proposed prediction system depending on enhancement system

Fig. 6 .
Fig. 6.Learned weights for the primary model Whisper decoder layers.

Fig. 7 .
Figure7shows the performance of the proposed model (P002) compared to all other challenge entries, as well as the challenge HASPI baseline.Prior is a system which always predicts the average of the intelligibility over the training set, regardless of the input.The proposed system is outperformed by only one other system, P011, which also utilises Whisper-derived features.The difference in performance is very slight, with P011 achieving an RMSE score of 25.1, while our system achieves 25.3.
. The exemplarinformed module incorporates a memory set of D exemplars, which are speech signals ŝ * drawn from the training data, with corresponding correctness values i * 1 , ..., i * D .The exemplars can be changed during training and for inference.Let y be the output of the attention pooling for input ŝ[n] (see Figure2).The exemplars, ŝ * 1 [n], ..., ŝ * D [n] 1 [n], ..., ŝ * D [n],are processed in the same way as the input, producing exemplar outputs from the attention pooling y * 1 , ..., y * D .The output, r, of the exemplar module is given by

Table 1 .
Validation and evaluation set results.