Peng, X. orcid.org/0000-0001-5787-9982, Lin, C. and Stevenson, R. orcid.org/0000-0002-9483-6006 (2021) Cross-lingual word embedding refinement by ℓ1 norm optimisation. In: Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T. and Zhou, Y., (eds.) Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. The 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 06-11 Jun 2021, Virtual conference. The Association for Computational Linguistics , pp. 2690-2701. ISBN 9781954085466
Abstract
Cross-Lingual Word Embeddings (CLWEs) encode words from two or more languages in a shared high-dimensional space in which vectors representing words with similar meaning (regardless of language) are closely located. Existing methods for building high-quality CLWEs learn mappings that minimise the L2 norm loss function. However, this optimisation objective has been demonstrated to be sensitive to outliers. Based on the more robust Manhattan norm (aka. â„“1 norm) goodness-of-fit criterion, this paper proposes a simple post-processing step to improve CLWEs. An advantage of this approach is that it is fully agnostic to the training process of the original CLWEs and can therefore be applied widely. Extensive experiments are performed involving ten diverse languages and embeddings trained on different corpora. Evaluation results based on bilingual lexicon induction and cross-lingual transfer for natural language inference tasks show that the L1 refinement substantially outperforms four state-of-the-art baselines in both supervised and unsupervised settings. It is therefore recommended that this strategy be adopted as a standard for CLWE methods.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Editors: |
|
Copyright, Publisher and Additional Information: | © 2021 Association for Computational Linguistics. Available under a Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/). |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 14 Jun 2021 10:35 |
Last Modified: | 14 Jun 2021 10:35 |
Status: | Published |
Publisher: | The Association for Computational Linguistics |
Refereed: | Yes |
Identification Number: | 10.18653/v1/2021.naacl-main.214 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:175136 |