Combining Conformer and Dual-Path-Transformer Networks for Single Channel Noisy Reverberant Speech Separation

Separation of overlapping speakers remains an active area of speech technology research. Many deep neural network (DNN) separation models propose modelling local and global temporal context separately using alternating DNN layers. Two such models are SepFormer and TD-Conformer. The largest configurations of each have comparable computational cost and similar performance; with SepFormer performing better on anechoic data and TD-Conformer yielding better results on noisy reverberant data. This work combines these two model types to gain insights into how their computational characteristics affect their performance. The generalization benefits of the larger model size of the conformer layers are demonstrated both on the WHAMR and the out-of-domain far-field evaluation set MC-WSJ-AV across a number of evaluation metrics. The proposed model is able to achieve 22.1 dB and 14.7 dB average scale-invariant signal-to-distortion ratio (SISDR) improvement when trained and evaluated on WSJ0-2Mix and WHAMR, respectively. The model trained using WHAMR is able to achieve 4.3 dB average SISDR improvement on the out-of-domain MC-WSJ-AV dataset.


INTRODUCTION
Speech separation and related technologies such as speaker extraction are important for many real-world applications [1] such as digital assistants [2,3], automatic meeting transcription [4,5] and assistive hearing [6].Recent speech separation research has heavily focused on time-domain audio separation network (TasNet), dualpath (DP) modelling and attention (or transformer) networks [7][8][9][10].TasNet models, first proposed in [11], are typically composed of an encoder, a mask estimation or mapping network, and a decoder where the encoder encodes signals from the time domain using a neural network layer and the decoder decodes this neural representation back into the time-domain [12].DP separation models, proposed in [13], use alternating neural network layers to process local and global contexts separately.One such DP transformer model, known as SepFormer [14], is one of the most performant models on separation benchmarks such as WSJ0-Mix and WHAMR [15,16].In this model, the input sequence is first split into fixed-size chunks which are input to a transformer layer for processing the local context.The chunk size and number of chunks axes are swapped and the Tensor is then processed by another transformer layer to model the global context.An analogue of the DP structure is the conformer model [9,17,18] where the local context is processed by a convolution module instead of a transformer.This approach generally has lower computational complexity for processing the local context for a fixed feature dimension but comes at the cost of increased model size.In the DP layers, the swapping of the axes, as opposed to processing the reconstructed feature sequence, reduces the computational complexity of the attention function in the global context layer.In the time domain conformer (TD-Conformer) model, proposed in [9], a subsampling layer is used to reduce the temporal resolution of the input sequence similarly and thus the computational complexity of the transformer layer used to process the global context.
In this work the convolutional separation transformer (ConSepT) model is proposed.ConSepT is a mixed Conformer and DP transformer model.The motivation for combining the two variant layers is that controlling for model complexity DP transformer models have been shown to be more performant for anechoic speech mixtures and conformer models have been shown to be more performant on noisy reverberant speech mixtures [9,14].This contrast is explored by mixing the two layer types to analyse if a higher overall performance can be obtained by combining the two.The model is structured so that conformer layers process the earlier features in the network under the assumption that they contain more noise and thus the DP transformer layers are used to process the cleaner features.There are two key contrasts between the two model types; firstly, the conformer layers result in a larger model size primarily due to the convolutional local context layer in the conformer block, and secondly, the DP transformer layers typically have significantly more computational complexity given to processing local context whereas conformer layers give most of their computational complexity to processing global context as they are implemented in this paper and in [9,14].The consequences of these characteristics of each layer type are explored with respect to model performance as well as model generalisation by varying the numbers of each type of layer while keeping the overall number of layers in the network constant.In order to do this, we contrast the generalisation benefits gained from using dynamic mixing (DM), where new training data is simulated for each epoch, and evaluate models trained on the simulated WHAMR dataset with the real recorded MC-WSJ-AV corpus.A preprocessing script for aligning the MC-WSJ-AC recordings is also provided as a part of this work.
In Section 2 the signal model is discussed.Section 3 introduces the ConSepT model.In Section 4 the training configurations and experimental setup are discussed.Results are given in Section 5 and conclusions in Section 6.

SIGNAL MODEL
A discrete-time single-channel noisy reverberant speech mixture signal of length of Lx samples, composed of C speaker signals sc[i] ∈ {1 . . .C} is defined as where * denotes the convolution, hc[i] the room impulse response (RIR) corresponding to speaker c and ν[i] an additive noise.The goal of this paper is to estimate the C clean speech signals sc[i]; these estimates are denoted by ŝc[i].

THE CONSEPT SPEECH SEPARATION MODEL
The proposed ConSepT model is described in the following.The model uses a TasNet architecture, composed of an encoder, mask estimation network and decoder, see Fig. 1.The mixture signal x[i] in ( 1) is first chunked into Lx blocks x ℓ of length LBL with 50%overlap.Each block x ℓ with block index ℓ is then encoded into a feature vector w ℓ which is passed to the mask estimation network to produce masks m c,ℓ for each speaker.The encoded features vectors are then masked for each speaker before being decoded back into the time domain.

Encoder
The encoder is composed of a single 1D convolutional layer that encodes time-domain blocks of the mixture signal x ℓ ∈ R 1×L BL using a weight matrix B ∈ R L BL ×N with feature dimension N , and a rectified linear unit (ReLU) activation function H(•) to give Lx encoded feature vectors

Mask Estimation Network
The mask estimation network is comprised of two sub-networks processed sequentially.The first is a conformer network with subsampling layers, based on [9], and the second is a dual-path transformer network, based on [14].
The conformer sub-network uses subsampling and supersampling layers to reduce the computational complexity of proceeding transformer layers in conformer blocks.The subsampling is performed using a projection layer proceeded by a 1D convolutional layer with a kernel size of 4 and a stride of 2, thus reducing the temporal resolution by a factor of 2. The effect of this subsampling on performance is explored in [9], where using a single subsampling layer is found to give a good trade-off between performance and efficiency.A set of R conf conformer layers proceeds after the subsampling layer.Each conformer layer is composed of four modules: a feed-forward module with internal feature dimension B cffn , a convolution module with kernel size Pconv and dimension Bconv, a multihead self-attention (MHSA) with positional encoding (PE) module of d conf attention heads, and another feed-forward module with dimension B cffn [9].A supersampling layer composed of a transposed 1D convolutional layer that reverses the subsampling layer follows the conformer layers.A final projection layer in the sub-network transforms the feature dimension back to N .
The DP transformer network is composed of a series of alternating local and global transformer layers with each combined local and global transformer layer being referred to as a single DP transformer layer.The output of the supersampling layer of the conformer subnetwork is first reorganised into overlapping chunks of length PDPT.The chunks are then processed by the local transformer of dimension Bintra with dintra attention heads.Following this, the axes for the chunk size and the number of chunks are swapped and then the sequence is processed by the global context transformer of dimension Binter with dinter heads.The axes are then swapped back and passed through an additional XDPT layers and the entire network is repeated RDPT times.
The final part of the network is a linear layer followed by a ReLU activation function that takes the output of the DP transformer network to produce a series of masks, m ℓ .

Decoder
The decoder is a transposed 1D convolutional layer with weights U ∈ R N ×L BL which transforms the masked encoded features w ℓ ⊙ m ℓ back into overlapping time-domain blocks The estimated time-domain speech signal ŝ[i] is then reconstructed from the signal blocks ŝℓ using the overlap-add method.(4)

Data
The WSJ0-2Mix [15] and WHAMR datasets [16] are used for training and evaluating models.WSJ0-2Mix is a simulated 2-speaker dataset of anechoic mixtures.WHAMR is a simulated noisy reverberant extension of WSJ0-2Mix.The 8kHz min configuration is used.The min configuration means mixtures are truncated to the shortest utterance as opposed to padding the shorter utterance to the longer one.The MC-WSJ-AV dataset [20] is also used for evaluating models on out-of-domain unseen data.The olap part of this dataset contains recorded far-field multi-channel recordings of 2speaker mixtures.The 20k subset of the dataset is used.The 1st channel of array 1 is used as the input mixture and headset microphones are used as reference signals.Preprocessing steps were performed to make the data suitable for evaluation.First, the audio is resampled from 16kHz to 8kHz as MC-WSJ-AV was recorded at 16kHz.The headset recordings were both aligned to the array signal using a cross-correlation method for computing time delays [21].
The loudness of the array channel was adjusted to match that of the sum of the headset channels using the pyloudnorm toolkit [22] to minimize the possibility of signal energy having an impact on the evaluation as the WHAMR mixture loudness is more similar to the targets than the array channels are to the headsets in MC-WSJ-AV.The preprocessing script is available on GitHub1 to allow reproducibility.

Training Configuration
The models use a similar training configuration as the TD-Conformer [9] with a learning rate of 10 −5 that is fixed for 90 epochs and then reduced if there is no performance improvement after 3 epochs.
Training signal lengths (TSLs) are limited to 4s and randomly sampled from the original training example [23].The feature dimension of the conformers layers are the same as the TD-Conformer-XL model in [9], i.e.B cffn = Bconv = 1024.The feature dimension of the DP transformer layers are the same as that in [14] Binter = Bintra = 1024.For the DP transformer layers XDPT = 2.
For the conformer layers, the number of attention heads d conf = 4 as in [9].For the DP transformer layers dinter = dintra = 8 as in [14].The constraint RDPT + R conf = 8 is used but the specific R values are experimented with the results section.The value 8 is used as it corresponds to the number of conformer layers in [9] and the number of DP transformer layers in [14].

Evaluation metrics
The main evaluation metric used to assess separation performance is the SISDR improvement over the original mixture signal, denoted ∆ SISDR.Improvement in extended short-time objective intelligibility (ESTOI), a speech intelligibility metric [24], and perceptual evaluation of speech quality (PESQ), a speech quality metric [25], are also reported for some results.Improvement in speech-to-reverberation modulation energy ratio (SRMR) [26] is used to assess the residual energy of reverberant effects in the estimated signals.The computational complexity of models is assessed using mutiply-accumulate operations (MACs).MACs are computed on a signal length of 5.79s, equal to the mean signal length in the WHAMR and WSJ0-2Mix corpora [23].Model size is reported in number of parameters.

Evaluations on in-domain data
The first evaluation analyzes performance for different ratios of conformer layer repeats R conf to DP transformer repeats RDPT for the standard configuration with RDPT + R conf = 8 on the WSJ0-2Mix and WHAMR datasets.R conf is varied from 0 to 8. The results are shown for both with and without using DM in Fig. 2. For both the WSJ0-2Mix evaluation and the WHAMR evaluation with DM the SISDR performance improves as the number of conformer layers increases towards 6, at which point it plateaus.This corresponds to an increase in the number of parameters and a relatively minor decrease in computational complexity.For the WHAMR evaluation without DM, SISDR performance remains fairly consistent for all R conf .This possibly suggests that without DM there is no benefit to having a larger model size as the model is as generalized as is possible without providing the network with new training examples.
The biggest performance gains with DM are seen on the more challenging WHAMR dataset which demonstrates the benefit of larger model sizes for noisy and reverberant data.

Evaluations on out-of-domain data
The models trained on WHAMR are re-evaluated using the out-ofdomain MC-WSJ-AV corpus, something seldom done in pure speech separation research due to the lack of properly aligned data, a problem we strove to solve in this work.The results are shown in Fig. 3  it should also be noted that the headset references of the MC-WSJ-AV evaluation set are not as "clean" as the WHAMR references, due to imperfect alignment and often small audio bleed from the other speaker in the room along with some minimal noise interference as well.This can be seen in Fig. 4 where the estimated speech signal ŝc in the middle panel appears more denoised than the "clean" reference sc in the lower panel.Interestingly, there is no similar trend in SRMR improvement as R conf increases (cf.lower panel in Fig. 3).SRMR results show good dereverberation performance.This was subjectively confirmed by listening through evaluation outputs.All models exhibited good dereverberation and noise suppression for both WHAMR and MC-WSJ-AV.The output speech however, contained notable distortions and intelligibility was lacking, this is reflected in Fig. 3 across all metrics in Table 1.

CONCLUSIONS
In this paper, a novel architecture combining DP transformer and conformer layers was proposed for modelling local and global contexts differently in speech separation networks.It was shown that for the purpose of generalisation in the case of the conformer layers, having a larger model size was beneficial particularly when DM was being used for training.It was shown that this generalisation finding extends to out-of-domain realistic evaluation data using an aligned version of the MC-WSJ-AV corpus.A new mixing script to allow the use of MC-WSJ-AV in other research was developed and provided in GitHub.
This work was supported by the Centre for Doctoral Training in Speech and Language Technologies (SLT) and their Applications funded by UK Research and Innovation [grant number EP/S023062/1].This work was also funded in part by 3M Health Information Systems, Inc.

3. 4 .
Objective functionSISDR is used as the objective function for training the networks.A permutation invariant training (PIT) wrapper around the function is used to resolve the speaker label permutation problem[19].The SISDR function is defined as L(ŝc, sc)

Fig. 1 .
Fig. 1.The ConSepT network composed of encoder, mask estimation network and decoder.The ⊙ symbol denotes the Hadamard product.

Fig. 2 .
Fig.2.Top and middle: separation performance against model configuration for WHAMR (top) and WSJ0-2Mix (middle).Bottom: corresponding computational complexity (in MACs) and model size for each configuration.

Table 1 .
Full results for best performing ConSepT model trained on WHAMR using DM in terms of SISDR that the ∆ SISDR values between the WHAMR and MC-WSJ-AV evaluations differ by ≈ 10dB, see Table1for more detailed numbers on the best performing DM models.This is partly explained by the fact that MC-WSJ-AV is real-world data and out-of-domain.Still, . A similar trend as in the previous section is observed with the increase in model size (i.e. more conformer layers than DP transformer layers) for SISDR, PESQ and ESTOI.Thus, the models are not just generalizing better towards the specific noisy reverberant acoustic conditions in WHAMR but noise and reverberation in general.Note Eval.set R conf RDPT Params.(M) PESQ ESTOI Fig. 3. Re-evaluation on MC-WSJ-AV of models trained using WHAMR with and without DM for ∆ SISDR, ∆ PESQ and ∆ ESTOI