Empirical Evaluation of Sequence-to-Sequence Models for Word Discovery in Low-resource Settings

Since Bahdanau et al. [1] first introduced attention for neural machine translation, most sequence-to-sequence models made use of attention mechanisms [2, 3, 4]. While they produce soft-alignment matrices that could be interpreted as alignment between target and source languages, we lack metrics to quantify their quality, being unclear which approach produces the best alignments. This paper presents an empirical evaluation of 3 main sequence-to-sequence models (CNN, RNN and Transformer-based) for word discovery from unsegmented phoneme sequences. This task consists in aligning word sequences in a source language with phoneme sequences in a target language, inferring from it word segmentation on the target side [5]. Evaluating word segmentation quality can be seen as an extrinsic evaluation of the soft-alignment matrices produced during training. Our experiments in a low-resource scenario on Mboshi and English languages (both aligned to French) show that RNNs surprisingly outperform CNNs and Transformer for this task. Our results are confirmed by an intrinsic evaluation of alignment quality through the use of Average Normalized Entropy (ANE). Lastly, we improve our best word discovery model by using an alignment entropy confidence measure that accumulates ANE over all the occurrences of a given alignment pair in the collection.


Introduction
Sequence-to-Sequence (S2S) models can solve many tasks where source and target sequences have different lengths. For learning to focus on specific parts of the input at decoding time, most of these models are equipped with attention mechanisms [1,2,3,4,6]. By-products of the attention are softalignment probability matrices, that can be interpreted as alignment between target and source. However, we lack metrics to quantify their quality. Moreover, while these models perform very well in a typical use case, it is not clear how they would be affected by low-resource scenarios.
This paper proposes an empirical evaluation of well-known S2S models for a particular S2S modeling task. This task consists of aligning word sequences in a source language with phoneme sequences in a target language, inferring from it word segmentation on the target side [5]. We concentrate on three models: Convolutional Neural Networks (CNN) [2], Recurrent Neural Networks (RNN) [1] and Transformer-based models [3]. While this word segmentation task can be used for the extrinsic evaluation of the soft-alignment probability matrices produced during S2S learning, we also introduce Average Normalized Entropy (ANE), a task-agnostic confidence metric to quantify the quality of the source-to-target alignments obtained. Experiments performed on a low-resource scenario for two languages (Mboshi and English) using equivalently sized corpora aligned to French, are, to our knowledge, the first empirical evaluation of these well-known S2S models for a word segmentation task. We also illustrate how our entropy-based metric can be used in a language documentation scenario, helping a linguist to efficiently discover types, in an unknown language, from an unsegmented sequence of phonemes. This work is thus also a contribution to the emerging computational language documentation domain [7,8,9,10,11], whose main goal is the creation of automatic approaches able to help the documentation of the many languages soon to be extinct [12].
Lastly, studies focused on comprehensive attention mechanisms for NMT [13,14,15] lack evaluation of the resulting alignments, and the exceptions [16] do so for the task of wordto-word alignment in well-resourced languages. Differently, our work is not only an empirical evaluation of NMT models focused on alignment quality, but it also tackles data scarcity of low-resource scenarios.

Unsupervised Word Segmentation from Speech
As in language documentation scenarios available corpora usually contain speech in the language to document aligned with translations in a well-resourced language, Godard et al. [5] introduced a pipeline for performing Unsupervised Word Segmentation (UWS) from speech. The system outputs timestamps delimiting stretches of speech, associated with class labels, corresponding to real words in the language. The pipeline consists of first transforming speech into a sequence of phonemes, either through Automatic Unit Discovery (e.g. [17]) or manual transcription. The phoneme sequences, together with their translations, are then fed to an attention-based S2S system that produces soft-alignment probability matrices between target and source languages. The alignment probability distributions between the phonemes and the translation words (as in Figure 1) are used to cluster (segment) together neighbor phonemes whose alignment distribution peaks at the same word. The final speech segmentation is evaluated using the Zero Resource Challenge 1 (ZRC) 2017 evaluation suite (track 2). 2 Figure 1: Soft-alignment probability matrices from the UWS task. ANE values (from left to right) are 0.11, 0.64 and 0.83. The gold segmentation is "BAH1T MAA1MAH0 PAA1PAH0 IH0Z AW1T", which corresponds to the English sentence "But mama, papa is out".

Parallel Speech Corpora
The parallel speech corpora used in this work are the English-French (EN-FR) [18] and the Mboshi-French (MB-FR) [19] parallel corpora. EN-FR corpus is a 33,192 sentences multilingual extension from librispeech [20], with English audio books automatically aligned to French translations. MB-FR is a 5,130 sentences corpus from the language documentation process of Mboshi (Bantu C25), an endangered language spoken in Congo-Brazzaville. Thus, while the former corpus presents larger vocabulary and longer sentences, the latter presents a more tailored environment, with short sentences and simpler vocabulary. In order to provide a fair comparison, as well as to study the impact of corpus size, the EN-FR corpus was also down-sampled to 5K utterances (to the exact same size than the MB-FR corpus). Sub-sampling was conducted preserving the average number of tokens per sentence, shown in Table 1.

Introducing Average Normalized Entropy (ANE)
In this paper, we focus on studying the soft-alignment probability matrices resulting from the learning of S2S models for the UWS task. To assess the overall quality of these matrices without having gold alignment information, we introduce Average Normalized Entropy (ANE). Definition: Given the source and target pair (s, t) of lengths |s| and |t| respectively, for every phone ti, the normalized entropy (NE) is computed considering all possible words in s (Equation 1), where P (ti, sj) is the alignment probability between the phone ti and the word sj (a cell in the matrix). The ANE for a sentence is then defined by the arithmetic mean over the resulting NE for every phone from the sequence t (Equation 2).
From this definition, we can derive ANE for different granularities (sub or supra-sentential) by accumulating its value for the full corpus, for a single type or for a single token. Corpus ANE will be used to summarize the overall performance of a S2S model on a specific corpus. Token ANE extends ANE to tokens by averaging NE for all phonemes from a single (discovered) token. Type ANE results from averaging the ANE for every token instance of a discovered type. Finally, Alignment ANE is the result of averaging the ANE for every discovered (type, translation word) alignment pair. Intuition that lower ANEs correspond to better alignments is exemplified in Figure 1.

Empirical Comparison of S2S Models
We compare three NMT models §3.1, §3.2, §3.3) for UWS, focusing on their ability of aligning words (French) with phonemes (English or Mboshi) in medium-low resource settings. The results, an analysis of the impact of data size and quality, and the correlation between intrinsic (ANE) and extrinsic (boundary F-score) metrics are presented in §3.4. The application of ANE for type discovery in low-resource settings is presented in §3.5.

RNN: Attention-based Encoder-Decoder
The classic RNN encoder-decoder model [1] connects a bidirectional encoder with an unidirectional decoder by the use of an alignment module. The RNN encoder learns annotations for every source token, and these are weighted by the alignment module for the generation of every target token. Weights are defined as context vectors, since they capture the importance of every source token for the generation of each target token.
Attention mechanism: a context vector for a decoder step t is computed using the set of source annotations H and the last state of the decoder network (translation context). The attention is the result of the weighted sum of the source annotations H (with H = h1, ..., hA) and their α probabilities (3) obtained through a feed-forward network align (4).

Transformer
Transformer [3] is a fully attentional S2S architecture, which has obtained state-of-the-art results for several NMT shared tasks. It replaces the use of sequential cell units (such as LSTM) by Multi-Head Attention (MHA) operations, which make the architecture considerably faster. Both encoder and decoder networks are stacked layers sets that receive source and target sequences, embedded and concatenated with positional encoding. An encoder layer is made of two sub-layers: a Self-Attention MHA and a feed-forward. A decoder layer is made of three sub-layers: a masked Self-Attention MHA (no access to subsequent positions); an Encoder-Decoder MHA (operation over the encoder stack's final output and the decoder self-attention output); and a feed-forward sub-layer. Dropout and residual connections are applied between all sub-layers. Final output probabilities are generated by applying a linear projection over the decoder stack's output, followed by a softmax operation.
Multi-head attention mechanism: attention is seen as a mapping problem: given a pair of key-value vectors and a query vector, the task is the computation of the weighted sum of the given values (output). In this setup, weights are learned by compatibility functions between key-query pairs (of dimension d k ). For a given query (Q), keys (K) and values (V) set, the Scaled Dot-Product (SDP) Attention function is computed as: In practice, several attentions are computed for a given QKV set. The QKV set is first projected into h different spaces (multiple heads), where the SDP attention is computed in parallel.
Resulting values for all heads are then concatenated and once again projected, yielding the layer's output. (6) and (7) illustrate the process, in which H is the set of h heads (H = h1, ..., h h ) and f is a linear projection. Self-Attention defines the case where query and values come from same source (learning compatibility functions within the same sequence of elements).

CNN: Pervasive Attention
Different from the previous models, which are based on encoder-decoder structures interfaced by attention mechanisms, this approach relies on a single 2D CNN across both sequences (no separate coding stages) [2]. Using masked convolutions, an auto-regressive model predicts the next output symbol based on a joint representation of both input and partial output sequences. Given a source-target pair (s, t) of lengths |s| and |t| respectively, tokens are first embed in ds and dt dimensional spaces via look-up tables. Token embeddings {x1, . . . , x |s| } and {y1, . . . , y |t| } are then concatenated to form a 3D tensor X ∈ R |t|×|s|×f 0 , with f0 = dt + ds, where: Each convolutional layer l ∈ {1, . . . , L} of the model produces a tensor H l of size |t| × |s| × f l , where f l is the number of output channels for that layer. To compute a distribution over the tokens in the output vocabulary, the second dimension of the tensor is used. This dimension is of variable length (given by the input sequence) and it is collapsed by max or average pooling to obtain the tensor H Pool L of size |t|×fL. Finally, 1×1 convolution followed by a softmax operation are applied, resulting in the distribution over the target vocabulary for the next output token. Attention mechanism: joint encoding acts as an attention-like mechanism, since individual source elements are re-encoded as the output is generated. The self-attention approach of [21] is applied. It computes the attention weight tensor α, of size |t| × |s|, from the last activation tensor HL, to pool the elements of the same tensor along the source dimension, as follows: where W1 ∈ R fa and W2 ∈ R fa×f L are weight tensors that map the fL dimensional features in HL to the attention weights via an fa dimensional intermediate representation.

Comparing S2S Architectures
For each S2S architecture, and each of the three corpora, we train five models (runs) with different initialization seeds. 3 Before segmenting, we average the produced matrices from the five different runs as in [5]. Evaluation is done in a bilingual segmentation condition that corresponds to the real UWS task.
In addition, we also perform segmentation in a monolingual condition, where a phoneme sequence is segmented with regards to the corresponding word sequence (transcription) in the same language (hence monolingual). 4 Our networks are optimized for the monolingual task. Across all architectures, we use embeddings of size 64 and batch size of 32 (5K data set), or embeddings of size 128 and batch size of 64 (33K data set). Dropout of 0.5 and early-stopping procedure are applied in all cases. RNN models have only one layer, a bi-directional encoder, and cell size equal to the embedding size, as in [5]. CNN models use the hyper-parameters from [2] with only 3 layers (5K data set), or 6 (33K data set), and kernel size of 3. Transformer models were optimized starting from the original hyperparameters of [3]. Best results (among 50 setups) were achieved using 2 heads, 3 layers (encoder and decoder), warm-up of 5K steps, and using cross-entropy loss without label-smoothing. Finally, for selecting which head to use for UWS, we experimented using the last layer's averaged heads, or by selecting the head with minimum corpus ANE. While the results were not significantly different, we kept the ANE selection.

Unsupervised Word Segmentation Results
The word boundary F-scores 5 for the task of UWS from phoneme sequence (in Mboshi or English) are presented in Table 2, with monolingual results shown for information only (topline). Surprisingly, RNN models outperform the more recent (CNN and Transformer) approaches. One possible explanation is the lower number of parameters (for a 5K setup, in average 700K parameters are trained, while CNN needs an additional 30.79% and Transformer 5.31%). However, for 33K setups, CNNs actually need 30% less parameters than RNNs, but still perform worse. Transformer's low performance could be due to the use of several heads "distributing" alignment information across different matrices. Nonetheless, we evaluated averaged heads and single-head models, and these resulted in significant decreases in performance. This suggests that this architecture may not need to learn explicit alignment to translate, but instead it could be capturing different kinds of linguistic information, as discussed in the original paper and in its examples [3]. Also, on the decoder side, the behavior of the selfattention mechanism on phoneme units is unclear and under-  [15] performed aftertraining encoder head removal based on head confidence, showing that after initial training, most heads were not necessary for maintaining translation performance. Hence, we find the Multi-head mechanism interpretation challenging, and maybe not suitable for a direct word segmentation application, such as our method. As in [24], our best UWS method (RNN) for the bilingual task does not reach the performance level of a strong Bayesian baseline [25] with F-scores of 89.80 (EN33K), 87.93 (EN5K), and 77.00 (MB5K). However, even if we only evaluate word segmentation performance, our neural approaches learn to segment and align, whereas this baseline only learns to segment. Section 3.5 will leverage those alignments for a type discovery task useful in language documentation.
The Pearson's ρ correlation coefficients between ANE and boundary F-scores for all mono and bilingual runs of all corpora (N = 30) are −0.98 (RNN), −0.97 (CNN), and −0, 66 (Transformer), with p-values smaller than 10 −5 . These strong negative correlations confirm our hypothesis that lower ANEs correspond to sharper and better alignments.

Impact of Data Size and Quality
EN33K and EN5K results of Table 2 allow us to analyze the impact of data size on the S2S models. For the bilingual task, RNN performance drops by 7% on average, whereas performance drop is bigger for CNN (14-15%). Transformer performs poorly in both cases, and increasing data size from 5K to 33K seems to help only for a trivial task (see monolingual results).
The EN5K and MB5K results of Table 2 reflect the impact of language pairs on the S2S models. We know from [26,27] that English should be easier to segment than Mboshi, and this was confirmed by both dpseg and monolingual results. However, this trend is not confirmed in the bilingual task, where the quality of the (sentence aligned) parallel corpus seems to have more impact (higher boundary F-scores for MB5K than for EN5K for all S2S models). As shown in Table 1, MB-FR corpus has shorter sentences and smaller lexicon diversity, while EN-FR is made of automatically aligned books (noisy alignments), what may explain our experimental results.

Type Discovery in Low-Resource Settings
We investigate the use of Alignment ANE as a confidence measure. From the RNN models, we extract and rank the discovered types by their ANE, and examine if it can be used to separate true words in the discovered vocabulary from the rest. The results for low-resource scenarios (only 5K) in Table 3 suggest that low ANE corresponds to the portion of the discov-  ered vocabulary the network is confident about, and these are, in most of the cases, true discovered lexical items (first row, P ≥ 70%). 6 As we keep higher Alignment ANE values, we increase recall but loose precision. This suggests that, in a documentation scenario, ANE could be used as a confidence measure by a linguist to extract a list of types with higher precision, without having to pass through all the discovered vocabulary. Moreover, as exemplified for EN5K in Table 4, we also retrieve aligned information (translation candidates) for the generated lexicon.

Conclusions
We presented an empirical evaluation of different architectures (RNN, CNN and Transformer) with respect to their capacity to align word sequences in a source language with phoneme sequences in a target language, inferring from it word segmentation on the target side (UWS task). 7 Although RNNs have been outperformed by CNN and Transformer-based models for machine translation, for UWS these architectures are still more robust in low-resource scenarios, and present the best segmentation results. We also introduced ANE, an intrinsic measure of alignment quality of S2S models. Accumulating it over the discovered alignments, we showed it can be used as a confidence measure to select true words, increasing Type F-scores.