Alosaimy, A and Atwell, E orcid.org/0000-0001-9395-3764 (2018) Diacritization of a Highly Cited Text: A Classical Arabic Book as a Case. In: Proceedings of ASAR'2018 Arabic Script Analysis and Recognition. ASAR'2018 Arabic Script Analysis and Recognition, 12-14 Mar 2018, Alan Turing Institute, The British Library, London UK. IEEE , pp. 72-77.
Abstract
We present a robust and accurate diacritization method of highly cited texts by automatically “borrowing” diacritization from similar contexts. This method of diacritization has been tested on diacritizing one book: “Riyad As-Salheen”, for the purpose of morphological annotation of the Sunnah Arabic Corpus. The original source of Riyad is about 48.66% diacritized, and after borrowing diacritization, the percentage jumps to 76.41% with low diacritic error rate (0.004), compared to 61.73% (DER=0.214) using MADAMIRA toolkit, and 67.68% (DER=0.006) using Farasa toolkit. More importantly, this method has reduced the word ambiguity from 4.83 diacritized form/word to 1.91.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. |
Keywords: | diacritization; Arabic; NLP; Sunnah; Riyad As-Salheen |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 16 Mar 2018 10:43 |
Last Modified: | 28 Mar 2018 20:13 |
Status: | Published |
Publisher: | IEEE |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:128591 |