Alrabiah, M, Al-Salman, A and Atwell, ES (2013) The design and construction of the 50 million words KSUCCA. In: Proceedings of WACL’2 Second Workshop on Arabic Corpus Linguistics. The WACL’2 Second Workshop on Arabic Corpus Linguistics, 22 Jul 2013, Lancaster University, UK. The University of Leeds , 5 - 8.
Abstract
In this paper, we report the design and construction of King Saud University Corpus of Classical Arabic (KSUCCA), which is part of ongoing research that attempts to study the meanings of words used in the holy Quran, through analysis of their distributional semantics in contemporaneous texts. The holy Quranic text was revealed in pure Classical Arabic, which forms the basis of Arabic linguistic theory and which is well understood by the educated Arabic reader. Therefore, it is necessary to investigate the distributional lexical semantics of the Quran's words in the light of similar texts (corpus) that are written in pure Classical Arabic. To the best of our knowledge, there exist only two corpora of Classical Arabic; one is part of the King Abdulaziz City for Science and Technology Arabic Corpus (KACST Arabic Corpus) and the other is the Classical Arabic Corpus (CAC) (Elewa, 2009). However, neither of the two corpora is adequate for our research; the former does not cover many genres such as: Linguistics, Literature, Science, Sociology and Biography; and it only contains 17+ million words, so it is not very large. While the latter is even smaller with only 5 million words. Therefore, we made an effort to carefully design and compose our own corpus bearing in mind that it should be large enough, balanced, and representative so that any result obtained from it can be generalized for Classical Arabic. In addition, we tried to make the design general enough in order to make the corpus also appropriate for other research.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | Alrabiah, M, Al-Salman, A and Atwell, ES (c) 2013, University of Leeds. Reproduced with permission from the copyright holders. |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) > Artificial Intelligence & Biological Systems (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 10 Dec 2014 11:20 |
Last Modified: | 16 Jan 2018 19:35 |
Published Version: | http://www.comp.leeds.ac.uk/eric/wacl/wacl2proceed... |
Status: | Published |
Publisher: | The University of Leeds |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:81860 |