SCORE: Self-Supervised Correspondence Fine-Tuning for Improved Content Representations

There is a growing interest in cost-effective self-supervised fine-tuning (SSFT) of self-supervised learning (SSL)-based speech models to obtain task-specific representations. These task-specific representations are used for robust performance on various downstream tasks by fine-tuning on the labelled data. This work presents a cost-effective SSFT method named Self-supervised Correspondence (SCORE) fine-tuning to adapt the SSL speech representations for content-related tasks. The proposed method uses a correspondence training strategy, aiming to learn similar representations from perturbed speech and original speech. Commonly used data augmentation techniques for content-related tasks (ASR) are applied to obtain perturbed speech. SCORE fine-tuned HuBERT outperforms the vanilla HuBERT on SUPERB benchmark with only a few hours of fine-tuning (< 5 hrs) on a single GPU for automatic speech recognition, phoneme recognition, and query-by-example tasks, with relative improvements of 1.09%, 3.58%, and 12.65%, respectively. SCORE provides competitive results with the recently proposed SSFT method SPIN, using only 1/3 of the processed speech compared to SPIN.


INTRODUCTION
Self-supervised learning (SSL) based pre-trained speech models such as HuBERT [1], WavLM [2] are becoming popular for their state-of-the-art performance on almost all speech applications.These models extract latent features that capture underlying factors of speech, such as acousticphonetic information, speaker information, semantic information, and more [3].These pre-trained representations are then fine-tuned for downstream application with labelled data.However, pre-trained SSL speech models may not be ideal for downstream tasks that do not align with the pretrained objective (for example, handling overlapping speech [2]).One way to overcome this issue is to introduce a pretraining objective that relates to the downstream task, such as training with overlapping speech in WavLM [2].However, this approach requires substantial amount of compute cost as the model is pre-trained from scratch.Another alternative falls within the realm of unsupervised or self-supervised fine-tuning (SSFT) [4].SSFT is applied on top of pre-trained models to learn task-specific representations.Then the SSL models are fine-tuned with labelled data on the downstream tasks for robust performance.For example, ContentVec [5] employs content preserving strategies (by disentangling speakers) on top of pre-trained HuBERT model to learn content-specific representations.However, ContentVec is not very cost-effective as it requires 19 hrs on 36 GPUs on top of the pre-trained HuBERT [1] model.Another recent SSFT approach for content-related downstream task is speakerinvariant clustering (SPIN) [4], which requires a compute cost less than 1% of ContentVec.SPIN employs speaker invariant clustering to improve content representations.The term SSFT was proposed in [4] to distinguish fine-tuning methods using only audio [5,6] from supervised fine-tuning using labelled data [7].
In this work, a simple and cost-effective SSFT method named Self-supervised Correspondence (SCORE) finetuning is proposed to preserve content.Correspondence training [8] is the task of learning similar representations from two different instances of the same spoken content.This technique has been successfully applied to extract high quality acoustic word embeddings (AWEs), where an autoencoder takes input as a spoken word and the target output as the same word spoken by a different speaker [8,9].This technique ensures that the encoder learns only content and forget other unnecessary information such as speaker, duration, prosody, etc. Taking inspiration from this, for SCORE fine-tuning, a perturbed speech is generated from the original speech in such a way that the spoken content is preserved.

Z Zꞌ
Fig. 1.SCORE fine-tuning method.SCORE takes a pair of original speech and perturbed speech as input.It then matches the output sequence X from the learnable model M θ against the output sequence X ′ from the frozen model M ϕ using soft-DTW loss.
in automatic speech recognition, such as speed perturbation [10] and pitch shifting.After obtaining perturbed speech, the objective is to learn similar speech representations from both the original speech and perturbed speech, making the representations pitch and duration invariant.Shifting pitch (fundamental frequency) alters the speaker information while keeping the content same.To match the representations from perturbed and original speech, soft-DTW [11,12] is used as a loss function.Soft-DTW is a popular loss function for time-series data, and it has also been successfully used for the multi-pitch estimation task in music information retrieval [13].
The proposed method is tested on three content-related downstream tasks on SUPERB benchmark [14]: automatic speech recognition (ASR), phoneme recognition (PR), and query-by-example spoken term discovery (QbE).The results are compared against the performance of vanilla models (HuBERT and WavLM) and recently proposed contentpreserving SSFT methods such as ContentVec and SPIN for SUPERB benchmark .
The main contributions of this work are as follows: • A novel cost-effective self-supervised fine-tuning method named SCORE is proposed to improve the content representations.
• With just less than 5 hours of SCORE fine-tuning on a single V100 GPU, SCORE fine-tuned models outperform vanilla HuBERT and WavLM on the SUPERB benchmark for content-related tasks.

METHODOLOGY
Fig. 1 demonstrates the proposed SCORE fine-tuning method.SCORE involves two instances of the pre-trained model, one with frozen parameters (M ϕ ) and the other with learnable top layers (M θ ), both having same initial model weights.Top layers are chosen for fine-tuning, as they encode phonetic content for most of the SSL models [15,16].More implementation details are described in Sec. 3. The input to the SCORE is a pair of perturbed speech and original speech, randomly fed to either M θ or M ϕ , as shown in Fig. 1.Randomizing the input ensures that the model M θ does not exclusively focus on the characteristics of perturbed speech, and found to be crucial to observe the benefits of the proposed method.
To obtain the perturbed speech, data augmentations used in ASR [10] are employed, such as speech perturbation and pitch shift.Torchaudio [17] is used for these perturbations, with SpeedPerturbation and PitchShift functions under torchaudio.transforms1 .The obtained representations from both the models M θ and M ϕ are projected to a lower dimension with linear feedforward layers and L2normalized.Obtained sequences from both models (X and X ′ ) are different in lengths due to the perturbations.Therefore, a dynamic time warping based differentiable loss function soft-Dynamic Time Warping (soft-DTW)2 [11,18,19] is used to match sequences of unequal lengths (Eq.1).This learning framework ensures that the model learns the speed and pitch invariant representations for the same spoken content.The soft-DTW replaces the "min" operation in the DTW with "soft-min" operation.Soft-DTW computes the soft-minimum of all alignment cost.The soft-DTW for two sequences X = x 1 , x 2 , ...x m and where A(X, X ′ ) is the set of all possible paths.The min γ is the soft-min operator with a smoothing factor γ and d is the for i=1 to Nsamp do 8: Si p = SpeedP erturbation(Si) 9: Si p = P itchShif t(Si p ) 10: k = random(0,1) 11: if k == 0 then 12: else 14: Compute Loss Lnorm(X, X ′ ).Update θ and µ to minimize Lnorm.
distance function.The soft-min operator min γ is defined as: In this work, soft-DTW γ is used as a loss function for SCORE fine-tuning as described in Eq. 3.

L(X, X
In all the experiments, we use a smoothing factor γ of 0.1.However, to address potential negative values in soft-DT W γ loss, a normalized version described in Eq. 4 is employed.This normalization guarantees a minimum loss value of zero for identical sequences, i.e., L norm (X, X) = 0, and ensures L norm (X, X ′ ) ≥ 0 for any pair of sequences.This approach guarantees a consistently positive loss [20,21].Further, the loss from Eq. 4 is normalized by dividing it with the total sequence length m + n.Algorithm 1 describes the entire SCORE fine-tuning method.

EXPERIMENTS
Experiments are conducted on two SSL speech models: Hu-BERT and WavLM (BASE versions).These SSL models are fine-tuned with the SCORE method.After the SCORE finetuning, obtained models are used for supervised training for the content-related downstream tasks on the SUPERB benchmark.Similar to SPIN [4], the top 2 layers (11 th and 12 th ) of the SSL models are fine-tuned as it is cost-effective (≈ 14M trainable parameters) and most of the SSL models encode phonetic content in top layers [15,16].In this study, Wav2vec2 [7] is omitted due to the fact that the linguistic content is less well represented in the final few layers [16], which is crucial for content-related tasks.Fine-tuning the entire model, from bottom layers to top layers, would result in increased computational expenses, contradicting the study's intended objectives.Furthermore, there is a concern that when the entire model is fine-tuned, the fine-tuning objective could potentially lead to a collapse of the original representations [22] learned during pre-training.The details about the data, SCORE fine-tuning, and evaluation on the SUPERB benchmark are described as following: Data: In line with prior research [4] and to ensure a fair comparison, experiments are performed on LibriSpeech's [23] train-clean-100 hours of data for SCORE fine-tuning.Consistent with earlier discoveries [4], training more layers or additional data does not enhance results.SCORE Fine-tuning Details: The representations obtained from the final Transformer layer (12 th ) of the models M θ and M ϕ are sequences of 768-dimensional vectors.These vectors are projected into 256-dimensional vectors with linear projection layers and then L2-normalized.The SCORE fine-tuning method is trained for 3.6k updates (≈ 1 epoch with effective batch size of 8).The model converged in just one epoch, and additional training did not yield any improvements.AdamW [24] optimizer is used with a learning rate of 2.0e − 5 with 1k warm-up updates.One epoch roughly takes < 5 hours on V100 GPU.More details are available at GitHub 3 .SUPERB Benchmark: S3PRL toolkit 4 is used for all the SUPERB benchmark tasks.For ASR and PR, features from all the layers are aggregated with learnable weights.These aggregated features are then fed to the prediction head for each downstream task and fine-tuned with labelled data.For ASR, the prediction head consists of 2-layer 1024-unit Bi-LSTM network with CTC loss on characters [14].The ASR model is evaluated without any external language model.For PR, the prediction head is a frame-wise linear transformation with CTC loss.More details can be found at SUPERB benchmark [14].Adam optimizer is used for both ASR and PR with learning rate of 1.0e − 4 and 5.0e − 4, respectively.We conducted experiments for each ASR and PR model five times and have provided the results, including the means and standard deviations, for both the vanilla models (HuBERT and WavLM) and their SCORE fine-tuned versions.For QbE, conventional supervised phoneme posteriorgram are replaced with SSL representations [14].For QbE, no training is required, and the evaluation is performed by running DTW on all layers separately and obtain a score for each querydocument pair.For the evaluation on test set, the best layer is selected based on performance on dev set from QUESST 2014 [25] data.In our case, we found that 12 th  ✩ The reported numbers are from their respective papers and SUPERB benchmark leaderboard [14] as of 13/09/2023 (https://superbbenchmark.org/leaderboard).✵ Our results when we run the SUPERB [14] baseline scripts for HuBERT and WavLM for fair comparison.
Table 1.Results of the proposed SCORE fine-tuning of HuBERT and WavLM models along with baseline methods on SU-PERB benchmark.The baseline methods include the BASE version of HuBERT and WavLM models, along with SSFT based ContentVec 500 and SPIN models.The downstream tasks include ASR, PR, and QbE, which are evaluated on word error rate (WER in %), phoneme error rate (PER in %), and maximum term weighted value (MTWV in %), respectively.

RESULTS AND DISCUSSIONS
Table 1 shows the processed during training in "pretraining" stage and in "SSFT stage".Processed speech is defined as "training steps × effective batch duration" to quantify machine-independent training costs [4].HuBERT + SCORE improves the HuBERT model on all three tasks with relative improvement of 1.09%, 3.58%, and 12.65% for ASR, PR, and QbE, respectively.WavLM + SCORE improves the WavLM model on ASR, PR and QbE with relative improvement of 0.32%, 2.68% and 0.76%, respectively.The results are also compared with a stronger baseline ContentVec 500 [5], which uses 76K hours of processed speech compared to the SCORE which uses only 100 hrs in SSFT stage.ContentVec 500 provides better results in ASR and PR when compared with SPIN and SCORE at the compute cost of 76K hrs.However, both HuBERT + SCORE and WavLM + SCORE outperform ContentVec 500 on QbE task.Like SPIN, the goal of this work is to strike a balance between improving the downstream task and the additional training (i.e.SSFT) required.SCORE only needs < 0.5 % of processed speech when compared with ContentVec 500 in SSFT stage.SCORE provides competitive results with the SPIN models.WavLM + SCORE outperforms WavLM + SPIN 256 in QbE task.Performance of HuBERT + SCORE is close to HuBERT + SPIN 256 on ASR.Among all the SSFT method, SCORE uses the least amount of processed speech (≈ 100 hrs) in SSFT stage.

Layerwise Analysis for Speaker Identification (SID)
One of the data augmentation techniques used in this work is pitch shift, which alters the speaker information.To assess the degradation of speaker information, experiments are conducted for SID task of SUPERB benchmark on VoxCeleb1 [26].The same configurations are used as provided in the SUPERB [14].Only the fine-tuned layers (i.e.11 th and 12 th ) were used for training and evaluating the SID system.The results are presented in Table 2. From Table 2, we can observe a drop in SID accuracy for both HuBERT and WavLM, in both layers.This suggests that the SCORE fine-tuned Hu-BERT + SCORE and WavLM + SCORE models have representations that are relatively more speaker-invariant than the original models, benefiting content-related tasks.Table 2. Layerwise SID accuracy (in %) on SUPERB benchmark for original and SCORE fine-tuned SSL models.

CONCLUSION AND FUTURE WORKS
A simple and cost-effective SSFT method named SCORE is proposed to improve content representations of the pretrained SSL speech models.For both the HuBERT and WavLM models, their respective SCORE fine-tuned models outperformed the original models on the SUPERB benchmark for ASR, PR, and QbE.Compared to other existing approaches of SSFT, SCORE requires the least amount of processed speech (less than 0.5% of processed speech compared to ContentVec 500 ).SCORE provides competitive results with SPIN using 1/3 of the processed speech used by SPIN.While we observed relatively fewer improvements in ASR compared to PR and QbE, we speculate that a stronger data augmentation technique directly applicable on speech waveforms could provide better gains.We consider this research direction for our future work.
This work was supported by the Centre for Doctoral Training in Speech and Language Technologies (SLT) and their Applications funded by UK Research and Innovation [grant number EP/S023062/1].This work was also funded in part by LivePerson, Inc.