Multi-CMGAN+/+: Leveraging Multi-Objective Speech Quality Metric Prediction for Speech Enhancement

Neural network based approaches to speech enhancement have shown to be particularly powerful, being able to leverage a data-driven approach to result in a significant performance gain versus other approaches. Such approaches are reliant on artificially created labelled training data such that the neural model can be trained using intrusive loss functions which compare the output of the model with clean reference speech. Performance of such systems when enhancing real-world audio often suffers relative to their performance on simulated test data. In this work, a non-intrusive multi-metric prediction approach is introduced, wherein a model trained on artificial labelled data using inference of an adversarially trained metric prediction neural network. The proposed approach shows improved performance versus state-of-the-art systems on the recent CHiME-7 challenge unsupervised domain adaptation speech enhancement (UDASE) task evaluation sets. Index Terms: speech enhancement, model generalisation, generative adversarial networks, conformer, metric prediction


INTRODUCTION
For training of supervised neural-network based speech enhancement systems, there is often a mismatch between the synthetic data used to train the system and real-world recordings.This can lead to poor performance of such systems in the wild even if intrusive evaluation metrics on synthetic data are high.A compounding factor in this problem is that metrics which are designed to measure speech quality do not always correlate strongly with actual human assessment of speech audio quality in many scenarios [1,2], and often require access to clean reference/label audio which may not be readily available for real-life recordings.Recently, several new metrics [3,4,5] have been proposed which attempt to directly predict human quality assessment in a non-intrusive way, i.e.where the clean speech reference is not required.These take the form of neural networks which are trained using vast datasets of distorted audio to predict a quality label assigned to the audio by the human assessors.Self Supervised Speech Representations (SSSRs) have been found to be useful feature representations for the prediction of audio quality [6].This paper comprises a system which builds on the authors' entry [7] to the CHIME-7 challenge UDASE [8] track.It attempts to address the problem of model adaption to real world data via a metric prediction generative adversarial network (GAN) based methodology.A non-intrusive GAN discriminator is trained to predict multiple metrics including a MOSrelated metric, as well as a traditional intrusive signal quality metric.Historical training data from a conventional generator and an additional pseudo-generator is used to augment the training data diversity.Then, during the training of the speech enhancement generator, inference of the multi-metric prediction discriminator is used to optimise the enhanced outputs towards the target metrics.In this way, metrics which are unable to be directly used as loss functions as well those which require access to a reference signal can be optimised.The remainder of this paper is structured as follows.The target metrics are described in Section 2. A description of the proposed Multi-CMGAN+/+ model is given in Section 3. Experimental setup and results are discussed in Section 4 and Section 5, respectively.Finally, Section 6 draws some conclusions from the findings of the paper.

SPEECH QUALITY METRICS
Two speech quality metrics, Perceptual Evaluation of Speech Quality (PESQ) and Deep Noise Suppression Mean Opinion Score (DNSMOS), are used as target metrics which the speech enhancement generator in our proposed system is trained to optimise towards.

PESQ
Perceptual Evaluation of Speech Quality (PESQ) [9] is a wellknown intrusive speech quality measure.It takes a time domain signal of the clean reference audio s[n] and the timedomain audio of the signal to be evaluated, e.g. the noisy signal x[n], and returns a value Q PESQ between 1 and 4.5 which represents the quality of the test signal, higher meaning better quality: The formulation of PESQ is non-differentiable, so direct use of it as a loss function for training enhancement models is not possible.

DNSMOS
Deep Noise Suppression Mean Opinion Score (DNSMOS) [3] is a non-intrusive speech quality metric.It consists of a neural network which was trained to predict human Mean Opinion Score (MOS) ratings for speech signals.As it is nonintrusive, it is particularly useful for assessing the quality of real-world recordings such as in the CHiME-7 UDASE challenge testset, and was one of the evaluation metrics used in assessing the entries to the challenge.For a input time domain speech signal s[n] DNSMOS estimates three values, being estimates of the well-known composite measure [10]: where Q SIG , Q BAK and Q OVR are each values between 1 and 5 which represent the estimated speech quality, background noise quality and overall quality, respectively (higher values indicating better quality).In this work the non-neural implementation of DNSMOS provided in the CHIME-7 baseline system is used.

Non-intrusive Metric Prediction
While DNSMOS is a neural network meaning it is theoretically possible to backpropagate through it and use it directly in a loss function, it is not publicly available in this form.
Similarly, the computation of PESQ is non-differentiable, and requires access to a reference signal, meaning it cannot be used for most real-world scenarios.In order to incorporate DNSMOS and PESQ in loss functions for speech enhancement in this work, a non-intrusive metric prediction discriminator [11] is trained to create differentiable 'clones' of the metrics.This has the added benefit of allowing for an adversarial training of the metric prediction network in a GAN setting [12].In the following, Q is used to represent one of these target metrics in (1) and ( 2) and Q ′ is the respective value normalized between 0 and 1.

SPEECH ENHANCEMENT SYSTEM
The overall architecture of the proposed system is based on the conformer-based metric GAN (CMGAN) framework proposed in [13], but with two extensions based on [14] and [15].The first extension is to train the discriminator D on a historical set of past generator outputs every epoch.The second extension is to train D to predict the metric score of noisy, clean and enhanced audio, as well as the output of a secondary pseudo-generator network N which is designed to increase the range of metric values observed by D. This work introduces a new structure for D allowing it to predict multiple metrics at once, as well as a new input feature which is derived from a pre-trained SSSR model.The conformer model generator G is based on the best performing CMGAN configuration in [13].The network itself combines mapping and masking approaches for spectral speech enhancement, utilizing a conformer [16] based bottleneck.The model's input are short-time Fourier transform (STFT) components of the complex-valued noisy audio, X Re , X Im , with a reasonably high temporal resolution (hop size of 6 ms with a 50% overlap, and a fast Fourier transform (FFT) length of 400 samples).The output of the model are the enhanced real and imaginary STFT components ŜRe and ŜIm from which the enhanced time domain audio ŝ[n] is obtained by inverse short-time Fourier transform (ISTFT).Note that the time index n is omitted for clarity in the following.

Generator Loss Function
The generator model G is trained with a multi-term loss function: L GGAN minimises the distance which represents an assessment of the enhanced signal by the metric Discriminator D. D( ŜFE ) is the inference of the metric prediction discriminator D, given the enhanced signal as input, which has an output of dimension N Q × 1 representing the N Q predicted normalised Q ′ values of the target metrics, i.e.N Q equals 3 when using (2).The 1 vector in (4), also of length N Q , represents the highest possible target metric values normalized between 0 and 1.Thus, the net effect of this loss term is to encourage G to maximise the predicted scores assigned to its outputs by D. L GTime is a mean absolute error between the enhanced and clean time domain mixtures: Finally, L GSISDR is the scale invariant signal-distortion ratio (SI-SDR) [17] loss With the exception of ( 4), all terms of L G require access to clean label/reference audio s.

Block Processing for Longer Inputs
Due to the quadratic time-complexity of the transformer layers in the conformer models, processing long sequences can be unfeasible due to high memory requirements.Transformers are also typically unsuitable for continuous processing as the entire sequence is required to compute self-attention.To address these issues input signals are processed in overlapping blocks of 4s for evaluation and inference as this has been shown to be in an optimal signal length for attention-based enhancement models [18].A 50% overlap with a Hann window is used to cross-fade each block with one an another.Models are trained with 4s signal length limits [18].

Metric Estimation Discriminator
The discriminator D part of the GAN structure is trained to predict three normalised speech quality metrics for a given input signal.Inference of D is used in (4) as one of the loss terms of G and as the sole loss function of N in (10), enforcing an optimisation towards the target metrics.
We experiment with training D to predict each outputs of DNSMOS (i.e Q SIG , Q BAK or Q OVR ), as well as PESQ (Q PESQ ).

HuBERT Encoder Feature Representations
Recent work in metric prediction [19,6] shows that SSSRs are useful as feature extractors for capturing quality-related information about speech audio.As such, the proposed system makes use of the Hidden Unit BERT (HuBERT) [20] Recent work in speech enhancement [6,21,22] have found that the outputs of HuBERT's encoder stage H FE (•) are particularly useful for capturing quality-related information, outperforming the final transformer layer and weighted sums of each transformer output.The outputs of H FE (•) are 2D representations with dimensions 512 × T where T depends on the length of the input audio in seconds.The HuBERT model used in this work is trained on 960 hours of audio-book recordings from the LibriSpeech [23] dataset, sourced from the FairSeq GitHub repo 1 .This HuBERT encoder representation is used as a feature extractor, and its parameters are not updated during the training of the metric prediction network.

Discriminator Network Stucture
The discriminator network structure consists of 2 bi-directional long short-term memory (BLSTM) layers followed by three parallel attention feed-forward layers with sigmoid activations, similar to the network proposed in [19].Each attention feed-forward layer outputs a single neuron which represents the prediction value of one of the three target metrics.
The input to D is the output of the HuBERT feature encoder H FE (•).The output of D has dimension B × N Q where B is the batch size and each of N Q values represents a normalised predicted metric value.Note that inference of D is always non-intrusive, even when if one of it's target metrics such as PESQ is intrusive.

Discriminator Loss Function
Within each epoch, first the Discriminator D is trained on the current training elements: where S FE , X FE , ŜFE and Y FE are HuBERT encoder representations, i.e. after H FE (•), of the clean signal s, the noisy signal x, the signal enhanced by G, ŝ, and the signal as enhanced by N , y.
are the true target metric scores of the input audio, normalized between 0 and 1. Please note that the Q ′ vectors in (9) can be shorter than 3 if less than N Q = 3 metrics are considered.This is followed by a historical training stage, where D is trained to predict the metric scores from past outputs of the generative networks G and N .

Historical Training
The training procedure of D uses historical training data as first proposed in the MetricGAN+ framework [14].In this stage, a sample of enhanced audio output from past epochs of G and N are used to train D. This aim of this is to widen prevent D from 'forgetting' how to assess audio which is dissimilar to the current outputs of the enhancement network.In each epoch, D is trained using a randomly selected 10% of the outputs of the generator models from past epochs.

Metric Data Augmentation Pseudo-Generator
As first proposed in [15], a secondary speech enhancement network N is trained, and its outputs y used to train the metric prediction discriminator D (last term in ( 9)) .This model is trained solely using the GAN loss in (4), similar to the original MetricGAN framework: where w is a hyperparameter value which corresponds to the target normalised DNSMOS score for which the output audio of N is being trained to obtain.Following on from prior work [7], here we fix the value of w at 1 meaning that N is trained to enhance relative to the target metrics, rather than to 'de-enhance' with a lower value of w. is constructed by the overlap-add method using the original noisy phase.

Training Setup
The framework is trained on simulated labelled data from the LibriMix [24] for 200 epochs, following a similar dataloading system as in [8] generating mixtures of a single speaker with noise.The labelled LibriMix training set consists of 33900 clean/noisy audio pairs, with the clean speech sourced from the LibriSpeech [23] dataset and the added noise from WHAM! [25] dataset.Each epoch, 300 samples from the training set are randomly selected.These are first used to train the metric prediction Discriminator D using (9).This is followed by the training of D on the historical set.Then the 300 random samples are used to train N using inference of D with (10), followed finally by the training of G using (3) which also uses inference of D. Different combinations of the DNSMOS terms and PESQ are experimented with as the three target metrics for D by setting each of The proposed models are evaluated on the CHiME7 UDASE task [8] evaluation sets.These are a real world unlabelled set consisting of CHIME5 recordings which are evaluated using DNSMOS and a simulated labelled set consisting of reverberant LibriMix audio which are evaluated using SI-SDR.
The proposed system is compared to our prior entry to the CHiME7 UDASE challenge [7], as well as the challenge baselines [8].Source code will be available at2 .

RESULTS
Table 1 shows the results of the proposed framework in terms of DNSMOS on the CHiME-7 UDASE task real evaluation set.The proposed systems significantly outperform the base-Table 1. DNSMOS results on CHiME5 eval set.Table 2 shows the results of the proposed framework in terms of SI-SDR on the CHiME-7 UDASE task simulated evaluation set.Here, the weaknesses of the proposed system relative to the CHiME-7 baseline systems is apparent, with our proposed framework significantly degrading the input with the exception of the model which does not optimise the SIG component of DNSMOS.

CONCLUSION
In this work a GAN framework utilising a multi-metric prediction discriminator is introduced.A number of combinations of target metric for this prediction network are experimented with, and improved performance on test set consisting of real data is shown.However a degradation in performance on a simulated testset is also shown, suggesting a significant distortion in the enhanced outputs of the proposed system.

3. 1 .
Conformer-based Speech Enhancement Generator 3.1.1.Conformer-based Generator Network Structure FE .The second stage, H OL (•), consists of a number of transformer layers, which takes the output of the first stage S FE as input.The two representations S FE and S OL can thus be obtained from the HuBERT model: SSSR as a feature extractor for the metric prediction component of the proposed framework.HuBERT, like most SSSRs which take time domain signals as input, consists of two distinct network stages.The first stage, H FE (•), comprises several 1D convolutional layers which map the input time-domain audio s[n] into a 2D representation S s network structure is based on the original MetricGAN enhancement model, consisting of a BLSTM which operates on a magnitude spectrogram representation of the input, followed by 3 linear layers.Its output is a magnitude mask which is multiplied by the input noisy spectrogram to produce an enhanced spectrogram Y SPEC .A time domain signal y[n]

Table 2 .
systems in all measures, while also outperforming the author's prior work CMGAN+/+ in terms of OVR and BAK.However, CMGAN+/+ still outperforms the proposed system in terms of SIG, which is the only metric it is optimized towards.SI-SDR results on the reverberant LibriCHiME eval set. line