Babych, B and Sharoff, S (2016) Ukrainian part-of-speech tagger for hybrid MT: Rapid induction of morphological disambiguation resources from a closely related language. In: Fifth Workshop on Hybrid Approaches to Translation (HyTra). European Association for Machine Translation (EAMT) Annual Conference 2016, 01 Jun 2016, Riga, Latvia. Universitat Pompeu Fabra
Abstract
This paper presents a methodology for rapid development of Ukrainian morphological disambiguation resources for a Ukrainian part-of-speech (PoS) tagger and lemmatiser now used in our hybrid MT system. The work is motivated by the need to disambiguate morphological features that result in different translations in rule-based MT and to address out-of-vocabulary (OOV) problem in statistical MT by training factored models. Without morphological disambiguation a larger training or development corpus would be needed to achieve acceptable coverage. Ukrainian, as many other under-resourced languages, does not have publicly released wide-coverage morphological annotation resources in standardised form. However, it has a smaller-scale non-disambiguating tagger with a lexicon of 15k frequent lemmas, which covers 200k unique word forms and generates on average 1.5 ambiguous tags per token (Kotsyba et al., 2009). It is based on a systematic linguistic description and a rich tagset for the Ukrainian morphology developed within the MULTEXT-East project (Erjavec, 2012; Kotsyba et al., 2010). On the other hand, for a better-resourced language, such as Russian, there exist open morphological disambiguation resources, e.g., parameter files for the language-independent TnT tagger trained on a large manually annotated Russian corpus, with estimated tag emission and transition probabilities (Sharoff, Nivre, 2011). Our methodology is based on the assumption that the syntax and morphology in historically related languages change slower than the lexicon, so sentences in them should normally have similar sequences of corresponding morphological features, even when large parts of the lexicon are no longer cognate. Under this assumption, the transition probabilities for the Ukrainian tags are estimated via systematically mapping the tags in the Russian transition parameter file into the Ukrainian tagset. This mapping is not straightforward and requires linguistic expertise in both languages, as even closely related languages have many unique category/value combinations, resulting in different tagsets. Nevertheless, the development time is much smaller than would be required for manually annotating the Ukrainian corpus needed for training the TnT tagger from scratch. Our baseline system described in this paper gives only an unsupervised approximation of the tag sequences in the Ukrainian corpus. It also uses tag emissions that are trivially derived from the seed lexicon, with equal probability settings for tags emitted by ambiguous word forms, and only lemmas mapped or disambiguated from the sample lexicon. However, this baseline is relatively strong as it gives an acceptable accuracy and coverage for morphological annotation tasks. We report evaluation results for the Ukrainian news corpus and we outline techniques for improving the baseline system, which include iterative re-estimation of emission and transition probabilities and iterative learning of rewriting operations for lemmatisation of previously unseen word forms. Resources are made freely available in a public domain on http://corpus.leeds.ac.uk/svitlana/tnt/ua/.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Keywords: | PoS tagging; lemmatization; morphological disambiguation; closely-related languages; under-resourced languages; Ukrainian; Russian; Hybrid MT; rapid development |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Arts, Humanities and Cultures (Leeds) > School of Languages Cultures & Societies (Leeds) > Translation Studies (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 15 Jun 2016 15:01 |
Last Modified: | 11 Apr 2017 21:19 |
Published Version: | http://glicom.upf.edu/hytra2016/program.html |
Status: | Published online |
Publisher: | Universitat Pompeu Fabra |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:100896 |