Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study

Abstract

The occurrence of code-switching in online communication, when a writer switches among multiple languages, presents a challenge for natural language processing tools, since they are designed for texts written in a single language. To answer the challenge, this paper presents detailed research on ways to detect code-switching in Arabic text automatically. We compare the prediction by partial matching (PPM) compression-based classifier, implemented in Tawa, and a traditional machine learning classifier sequential minimal optimization (SMO), implemented in Waikato Environment for Knowledge Analysis, working specifically on Arabic text taken from Facebook. Three experiments were conducted in order to: (1) detect code-switching among the Egyptian dialect and English; (2) detect code-switching among the Egyptian dialect, the Saudi dialect, and English; and (3) detect code-switching among the Egyptian dialect, the Saudi dialect, Modern Standard Arabic (MSA), and English. Our experiments showed that PPM achieved a higher accuracy rate than SMO with 99.8% versus 97.5% in the first experiment and 97.8% versus 80.7% in the second. In the third experiment, PPM achieved a lower accuracy rate than SMO with 53.2% versus 60.2%. Code-switching between Egyptian Arabic and English text is easiest to detect because Arabic and English are generally written in different character sets. It is more difficult to distinguish between Arabic dialects and MSA as these use the same character set, and most users of Arabic, especially Saudis and Egyptians, frequently mix MSA with their dialects. We also note that the MSA corpus used for training the MSA model may not represent MSA Facebook text well, being built from news websites. This paper also describes in detail the new Arabic corpora created for this research and our experiments.

Metadata

Item Type:	Article
Authors/Creators:	Tarmom, T https://orcid.org/0000-0002-2834-461X Teahan, W Atwell, E https://orcid.org/0000-0001-9395-3764 Alsalka, MA https://orcid.org/0000-0003-3335-1918
Copyright, Publisher and Additional Information:	© Cambridge University Press 2020. This article has been published in a revised form in Natural Language Engineering [https://doi.org/10.1017/S135132492000011X]. This version is free to view and download for private research and study only. Not for re-distribution, re-sale or use in derivative works. Uploaded in accordance with the publisher's self-archiving policy.
Keywords:	Arabic; Corpus linguistics; Language resources; Machine learning; Sublanguages and controlled languages; Text segmentation
Dates:	Accepted: 22 July 2019 Published (online): 5 May 2020 Published: November 2020
Institution:	The University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds)
Date Deposited:	22 Jan 2020 13:58
Last Modified:	27 Apr 2021 10:03
Status:	Published
Publisher:	Cambridge University Press
Identification Number:	10.1017/S135132492000011X
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:155881

CORE (COnnecting REpositories)

Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study

Abstract

Metadata

Download

Accepted Version

Export

Statistics