Alshutayri, A orcid.org/0000-0001-8550-0597 and Atwell, E orcid.org/0000-0001-9395-3764 (2018) Arabic dialects annotation using an online game. In: ICNLSP 2018: 2nd International Conference on Natural Language and Speech Processing. 2nd International Conference on Natural Language and Speech Processing (ICNLSP 2018), 25-26 Apr 2018, Algiers, Algeria. IEEE ISBN 978-1-5386-4543-7
Abstract
Modern Standard Arabic is the written standard across the Arab world; but there is an increasing use of Arabic dialects in social media, so this is appropriate as a source of a corpus for research on classifying Arabic dialect texts using machine learning algorithms. An important first step is annotation of the text corpus with correct dialect tags. We collected tweets from Twitter and comments from Facebook and online newspapers, aiming for representative samples of five groups of Arabic dialects: Gulf, Iraqi, Egyptian, Levantine, and North African. Then, we explored an approach to crowdsourcing corpus annotation. The task of annotation was developed as an online game, where players can test their dialect classification skills and get a score of their knowledge. This approach has so far achieved 24K annotated documents containing 587K tokens; 16,179 tagged as a dialect and 7,821 as Modern Standard Arabic.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2018 IEEE. This is an author produced version of a paper published in ICNLSP 2018: 2nd International Conference on Natural Language and Speech Processing. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Uploaded in accordance with the publisher's self-archiving policy. |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 11 Jun 2018 11:19 |
Last Modified: | 23 Jan 2019 10:16 |
Status: | Published |
Publisher: | IEEE |
Identification Number: | 10.1109/ICNLSP.2018.8374371 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:131819 |