Learning Cross-lingual Mappings for Data Augmentation to Improve Low-Resource Speech Recognition

Exploiting cross-lingual resources is an effective way to compensate for data scarcity of low resource languages. Recently, a novel multilingual model fusion technique has been proposed where a model is trained to learn cross-lingual acoustic-phonetic similarities as a mapping function. However, handcrafted lexicons have been used to train hybrid DNN-HMM ASR systems. To remove this dependency, we extend the concept of learnable cross-lingual mappings for end-to-end speech recognition. Furthermore, mapping models are employed to transliterate the source languages to the target language without using parallel data. Finally, the source audio and its transliteration is used for data augmentation to retrain the target language ASR. The results show that any source language ASR model can be used for a low-resource target language recognition followed by proposed mapping model. Furthermore, data augmentation results in a relative gain up to 5% over baseline monolingual model.


Introduction
End-to-end (e2e) acoustic modelling techniques require a lot of training data for reliable parameters estimation.However, more than half of the world's population speak only 23 languages out of more than 7000 languages being spoken across the globe [1].Thus only a few languages have sufficient data resources, and a lot of languages are still under resourced to build an ASR system.For such languages, multilingual speech recognition systems have stolen the lime light over the past decade [2,3,4,5,6,7] which have been used for feature extraction [8,9,10] or directly for transfer learning [11,12].
Data augmentation is another approach to increase the training data of a low-resource language.Commonly used data augmentation technique includes extending training data by making perturbed copies either by adding noise [13,14], varying speed and tempo of original speech [15], vocal tract length perturbation (VTLP) [16,17], SpecAugment [18] and combinations of these methods [19].All these techniques are based on audio data augmentation.
In the recent past, a few studies have been done to augment data by processing text rather than speech [20,21,22].Transcripts from different languages have been transliterated to Latin script to train a multilingual system [21].However, it requires paired data (a word in original script and its transliteration in Latin) for each language.Thomas et al. [22] have proposed to transliterate a source language data to the target language without using parallel data.Source language audio data has been decoded using the target language ASR to transliterate them into target language which is then used as augmented data to retrain target language ASR.Though this is a novel idea, an out of domain ASR has no knowledge of input language and thus is not expected to generate a good transliteration.
Recently, we have proposed a technique to learn crosslingual acoustic-phonetic similarities on phoneme level [23] which has been used for multilingual and cross-lingual acoustic model fusion [24].A model is trained to learn mappings from a source language ASR output posterior distributions to that of the target language ASR.The study has been based on an underlying assumption that these mapping models can learn some language-related relations between phonemic posterior distributions.Though the study proves the concept, the work has been done on phoneme level using DNN-HMM hybrid systems and handcrafted lexicons for each language.In this work, we extend the previous work for cross-lingual e2e speech recognition systems.Then the ASR systems of source languages followed by a source-target mapping model for each source-target pair is used to transliterate source data into the target language script.Though both the components are trained on task specific data and are expected to generate better output labels, transliteration of a source language audio data into the target language is still unintelligible especially for unrelated languages and thus called ciphered data.So, the key contribution of this work is two-fold; • it extends the concept of learning cross-lingual mappings for e2e speech recognition systems and • generates ciphered text for a target language data augmentation using source languages ASR and <source-target> mapping models.
Exploiting mapping models for cross-lingual speech recognition shows that using a source language ASR for a target language gives comparable results.These mapping models are trained on limited data, and using a source language ASR followed by a mapping model enables us to exploit cross-lingual ASR to recognise the target language speech data.Furthermore, the proposed data transliteration and augmentation techniques yield up to 5% and 28.5% relative improvement in character error rate (CER) when compared with monolingual and multilingual ASR systems respectively.

Mapping models
Let MA and MS i be the monolingual acoustic models of the target and i th source language respectively, a mapping model NS i A is trained to translate posteriors PS i of dimension dS i from MS i to the posteriors PS i A of dimension dA where dA is the dimension of posteriors from MA.Given a set of observa- A mapping model is trained using KL divergence loss to map posteriors from i th source acoustic model (P S i ) to the target language posteriors (P S i A ).The loss function is given as; where B is the number of frames in one batch for training a mapping model NS i A to map posteriors from i th source language to the target language.
Mapping models in the previous work [24] have been trained on frame level without considering the contextual information but connected speech is a continuous signal which poses co-articulation and temporal smearing.Furthermore, a separate model has been trained for each source-target language pair rising a requirement of N (N − 1) mapping models.So, the architecture of mapping model is modified in this work to a sequence-to-sequence model with Multi Encoder Single Decoder (MESD) architecture.Thus, it incorporates contextual information and reduces the required number of mapping models to just N .The architecture of MESD is shown in the Figure 1.
During the training of MESD model, outputs from all the source acoustic models for a given utterance u are fed to sourcelanguage dependent encoders successively.Embeddings from the final layer of the encoders are then passed to a single targetlanguage dependent decoder.Loss is calculated as mean of the losses for all encoder-decoder pairs.
where K is the number of total source languages (N − 1), K in the case of mean average and LS k A is given in Equation 1 which is still frame based.It allows mapping models training to converge in low-resource setting as a small amount of data provides millions of examples.However, this causes unbalanced training across languages as mean value can be continuously decreasing when loss for one of the languages is decreasing monotonically but increasing in same fashion for the other one.This can cause model to learn mappings for one language way better than the other.To cope with this issue, a dynamic weighting scheme is applied to weight the losses for each encoder-decoder loss.For the experimentation here, rank sum weighting [25] is used to assign the weights.In this scheme, weights are assigned based on their normalised ranks.So, w in Equation 2 now becomes where r is rank of the language when the languages are sorted on decreasing values of their losses.It restricts model from biasing towards a specific language or a group of languages.Though a mapping model contains multiple encoders, any encoder can be used with decoder during decoding and MESD does not require data stream from all the encoders for a given utterance.It implies that mappings can be obtained having input even from only one source language at a time.Training of these mapping models allows to use any source language ASR for decoding the data of a target language followed by the sourcetarget mapping model.

Ciphering text
In the previous work [22], target language ASR has been used for transliteration of source language audio data for data augmentation and retraining of target language ASR.However, an ASR does not have any source language information and is not expected to generate a rationale transliterated transcriptions.
In this work on the contrary, source language audio data is decoded using in-domain ASR (MS i ) and then the output posterior distributions (P S i ) are transformed to the target language posterior distributions (P S i A ) using the source-target mapping model (NS i A).Mapped posteriors from the mapping models are then used to generate transliterated transcriptions (alternatively referred as ciphered text or transcriptions) using greedy decoding.Though the transliterations still might not be exact transliterations (thus called ciphered text), both the components involved in the process are trained using the task-specific data and are expected to perform better.
Source language audios and their ciphered transcriptions are then used as augmented data for retraining of the target language ASR.The flow is shown in the Figure 2.

Experimental setup 4.1. Data set
As this work extends the previous work, experiments here are done on same data set as used in [24].Full Language Packs (FLP) of four low-resource languages from IARPA BABEL speech corpus [26] (Tamil (tam), Telugu (tel), Cebuano (ceb) and Javanese (jav)) are used for baseline ASR training and evaluation.BABEL data set mostly consist of conversational telephone speech with real-time background noises and is quite challenging because of conversation styles, limited bandwidth, environment conditions and channel.All the utterances without any speech are discarded.The details of the data sets are tabulated in Table 1.
For training of the mapping models, a subset of 30 hours is randomly selected from each language pack.This data is further split into 29 hours of train set and 1 hour of dev set.

Speech recognition systems
Hybrid CTC/attention architecture [27] is used to train all speech recognition models which consists of three modules that are; a shared encoder, an attention decoder and a CTC module.The training process jointly optimises the weighted sum of CTC and attention model.
The input to the model is 40 filterbanks and the output of the model is the byte-pair encoded (BPE) tokens.Monolingual ASRs are trained for 100 BPE tokens for each language while the output of multilingual ASR is 400 tokens.SentencePiece library [28] is used for tokenisation.During decoding, the final prediction is made based on a weighted sum of log probabilities from both the CTC and attention components.Given a speech input X, the final prediction Ŷ is given by; (5) where λ is a hyper-parameter.The values of α and λ are kept same for all ASR systems.SpeechBrain toolkit [29] is used for training of all ASR systems.

Mapping models
A multi encoder single decoder model is trained for each target language.In an MESD model, there are three encoders and only one attention decoder.Each encoder and single decoder consists of one bidirectional RNN layer.For each target language, mapping model size is only 2.59 million parameters.

Performance metric
Accuracy of a mapping model is measured as the ratio of number of correctly mapped frames to the total number of frames as given in Equation 6. Correctly mapped frames are defined as the frames where the values of arg max(mapped posteriors) and arg max(targetAM posteriors) are the same.
where k is the index of classes in the output vector pt, CM F is the number of correctly mapped frames and T is the total number of frames.
For downstream speech recognition task, results are reported in terms of percent character error rate.

Mapping models
Accuracies of mapping models, trained to map posterior distribution from a source language ASR to the target language ASR, are tabulated in Table 2. Analysis shows that correct target class is still among top n mapped classes if not the most probable one.So, the mapping models accuracy is calculated for different values of n where n represents the number of most probable classes.Though the accuracy increases with increasing value of n, rate of change is not as much as observed in case of phonemes by [24] which implies that the performance of mapping model in case of phoneme based hybrid DNN-HMM systems has been better than that for e2e systems.Since the mapping models are trained using posterior distributions of ASR outputs, one potential reason could be the detrimental affect of speech recognition systems on the training of mapping models.However, the joint analysis of amount of training data (Table 1), performance of monolingual speech recognition systems (Table 3) and performance of mapping models (Table 2) rules out this reason.
Amount of mapping model training data is same for all the languages but the mappings for ceb and jav target language is better than tam and tel.Even for ceb and jav target languages, accuracy of mappings from tel source language is very low in comparison to other source languages.The investigation reveals that as the number of BPE tokens are restricted to 100 for all the languages, ceb and jav having only 19 and 26 characters respectively have good context coverage in 100 BPE tokens.But the BPE tokens extracted for tel, which have more than 52 characters, do not cover context very well.Furthermore, both ceb and jav are written in Latin script and thus have a full overlap of characters and are even acoustically close.While on the other hand though both tam and tel belong to same Dravidian family, their writing scripts are different which makes it difficult for model to learn mappings with limited number of BPE tokens.

Ciphering text
For a given target language, audio data of all the source languages is decoded using language dependent ASR systems and the output posterior distributions are then mapped to target language distributions using the mapping models NS i A. Greedy decoding is carried out on these output posterior distributions to

ASR
Monolingual systems (mono) are the language dependent acoustic and language models which are trained on target language specific data sets.The train sets of all the languages are then mixed to train a multilingual system (multi).Language model for a multilingual system is also trained using mix corpora of individual languages.The results of speech recognition systems are shown in Table 3.The first row contains the monolingual ASR result without using LM for a later comparison while rest of the results are ASR decoding with LM.
For a given target language test set, speech recognition results are also computed on top of mapping models after decoding target language data using source language acoustic mod-  4. CER on diagonal is the same as the first row of Table 3.Though these results are from source language ASR followed by a source-target mapping model and does not use language dependent ASR, it preforms better than monolingual ASR in case of jav.Results are comparable for other languages but fairly depend on mapping models performance.It is evident from these results that a source language acoustic model can be used for decoding of a target language followed by a mapping model trained on limited amount of data.

Data augmentation
Ciphered transcriptions are generated from all the source languages for a target language using mapping models as described in Section 3. Then the audio data of source languages and the ciphered transcriptions are used together as augmented data for retraining of target language ASR (augAll).As described earlier, the quality of ciphered transcriptions depends on performance of mapping models, using ciphered transcriptions data augmentation from all the source languages include very low quality transcriptions and have detrimental effect on retraining of target language ASR.So, the augmentation is then restricted to use ciphered data from only closest language (augTwo).For a target language, the source language with highest mapping model accuracy is chosen as the closest language.By augmenting this data for retraining of a target language, an relative gain of up to 5% is achieved in terms of CER (augTwo).

Conclusion
In this work, the technique of mapping models is extended for e2e speech recognition systems.For a given target language, a mapping model is trained on limited amount of data to transform output posterior distributions from a source language ASR model to that of the target language.A source language ASR followed by a mapping model is then used for cross-lingual speech recognition in low-resource setting.Mapping models are further exploited to transliterate data of a source language to the target language for data augmentation.Retraining of target language ASR after data augmentation results in a relative CER reduction of up to 5% and 28.5% in comparison to monolingual and multilingual ASR systems respectively.

Figure 1 :
Figure 1: Architecture of the MESD mapping model

Figure 2 :
Figure 2: Flow of generating data for augmentation and retraining of target language ASR

Table 1 :
Details of BABEL data sets used for the experimentation

Table 2 :
Accuracy of the mapping models considering top n mapped classes

Table 3 :
ASR performance in terms of %CER Greedy decoding is applied on mapped posteriors and the results are shown in Table

Table 4 :
Cross-lingual ASR performance in terms of %CER