Sharov, S orcid.org/0000-0002-4877-0210 (2020) Finding next of kin: Cross-lingual embedding spaces for related languages. Natural Language Engineering, 26 (2). pp. 163-182. ISSN 1351-3249
Abstract
Some languages have very few NLP resources, while many of them are closely related to better-resourced languages. This paper explores how the similarity between the languages can be utilised by porting resources from better- to lesser-resourced languages. The paper introduces a way of building a representation shared across related languages by combining cross-lingual embedding methods with a lexical similarity measure which is based on the weighted Levenshtein distance. One of the outcomes of the experiments is a Panslavonic embedding space for nine Balto-Slavonic languages. The paper demonstrates that the resulting embedding space helps in such applications as morphological prediction, named-entity recognition and genre classification.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © Cambridge University Press 2019. This article has been published in a revised form in Natural Language Engineering https://doi.org/10.1017/S1351324919000354. This version is free to view and download for private research and study only. Not for re-distribution, re-sale or use in derivative works. |
Keywords: | Multilinguality; Text classification; Cross-lingual embeddings |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Arts, Humanities and Cultures (Leeds) > School of Languages Cultures & Societies (Leeds) > Translation Studies (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 25 Jun 2019 13:01 |
Last Modified: | 27 Mar 2020 13:08 |
Status: | Published |
Publisher: | Cambridge University Press |
Identification Number: | 10.1017/S1351324919000354 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:147706 |