Alshutayri, AOO orcid.org/0000-0001-8550-0597 and Atwell, E (2017) Exploring Twitter as a Source of an Arabic Dialect Corpus. International Journal of Computational Linguistics (IJCL), 8 (2). pp. 37-44. ISSN 2180-1266
Abstract
Given the lack of Arabic dialect text corpora in comparison with what is available for dialects of English and other languages, there is a need to create dialect text corpora for use in Arabic natural language processing. What is more, there is an increasing use of Arabic dialects in social media, so this text is now considered quite appropriate as a source of a corpus. We collected 210,915K tweets from five groups of Arabic dialects Gulf, Iraqi, Egyptian, Levantine, and North African. This paper explores Twitter as a source and describes the methods that we used to extract tweets and classify them according to the geographic location of the sender. We classified Arabic dialects by using Waikato Environment for Knowledge Analysis (WEKA) data analytic tool which contains many alternative filters and classifiers for machine learning. Our approach in classification tweets achieved an accuracy equal to 79%.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | This is an open access article under the terms of the Creative Commons Attribution License (CC-BY). |
Keywords: | Dialectal Arabic; Phonological Variations; Social Media; Multi Dialect; Twitter; Tweet |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) |
Funding Information: | Funder Grant number EPSRC EP/K015206/1 |
Depositing User: | Symplectic Publications |
Date Deposited: | 15 Jun 2017 09:23 |
Last Modified: | 05 Oct 2017 15:38 |
Published Version: | http://www.cscjournals.org/library/manuscriptinfo.... |
Status: | Published |
Publisher: | CSC Journals |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:117781 |