Aker, A., Petrak, J. and Sabbah, F. (2017) An extensible multilingual open source lemmatizer. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017. International Conference Recent Advances in Natural Language Processing, 02-08 Sep 2017, Varna, Bulgaria. ACL , pp. 40-45.
Abstract
We present GATE DictLemmatizer, a multilingual open source lemmatizer for the GATE NLP framework that currently supports English, German, Italian, French, Dutch, and Spanish, and is easily extensible to other languages. The software is freely available under the LGPL license. The lemmatization is based on the Helsinki Finite-State Transducer Technology (HFST) and lemma dictionaries automatically created from Wiktionary. We evaluate the performance of the lemmatizers against TreeTagger, which is only freely available for research purposes. Our evaluation shows that DictLemmatizer achieves similar or even better results than TreeTagger for languages where there is support from HFST. The performance drops when there is no support from HFST and the entire lemmatization process is based on lemma dictionaries. However, the results are still satisfactory given the fact that DictLemmatizer isopen-source and can be easily extended to other languages. The software for extending the lemmatizer by creating word lists from Wiktionary dictionaries is also freely available as open-source software.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2017 The Author(s). This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
Keywords: | Natural Language Processing |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Funding Information: | Funder Grant number EUROPEAN COMMISSION - FP6/FP7 PHEME - 611233 EUROPEAN COMMISSION - HORIZON 2020 COMRADES - 687847 |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 05 Jan 2018 16:40 |
Last Modified: | 05 Jan 2018 16:40 |
Published Version: | https://doi.org/10.26615/978-954-452-049-6_006 |
Status: | Published |
Publisher: | ACL |
Refereed: | Yes |
Identification Number: | 10.26615/978-954-452-049-6_006 |
Related URLs: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:124254 |