A Basic Language Resource Kit Implementation for the IgboNLP Project

Abstract

Igbo, an African language with around 32 million speakers worldwide, is one of the many languages having few or none of the language processing resources needed for advanced language technology applications. In this article, we describe the approach taken to creating an initial set of resources for Igbo, including an electronic text corpus, a part-of-speech (POS) tagset, and a POS-tagged subcorpus. We discuss the approach taken in gathering texts, the preprocessing of these texts, and the development of the POS tagged corpus. We also discuss some of the problems encountered during corpus and tagset development and the solutions arrived at for these problems.

Metadata

Item Type:	Article
Authors/Creators:	Onyenwe, I.E. Hepple, M. https://orcid.org/0000-0003-1488-257X Chinedu, U. Ezeani, I.
Copyright, Publisher and Additional Information:	© 2018 ACM
Keywords:	Natural language processing (NLP); language technology; corpus annotation; part-of-speech (POS) tagging; tokenization; text processing; segmentation; normalization; African language; Igbo; corpora; morphology; interannotation agreement; human annotator; tagset
Dates:	Accepted: 1 September 2017 Published (online): 11 January 2018 Published: February 2018
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Depositing User:	Symplectic Sheffield
Date Deposited:	15 May 2018 08:41
Last Modified:	30 Apr 2020 13:39
Published Version:	https://doi.org/10.1145/3146387
Status:	Published
Publisher:	ACM
Refereed:	Yes
Identification Number:	10.1145/3146387
Related URLs:	Author
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:130753

CORE (COnnecting REpositories)

A Basic Language Resource Kit Implementation for the IgboNLP Project

Abstract

Metadata

Download not available

Export

Statistics