Altammami, S orcid.org/0000-0002-3801-8236, Atwell, E and Alsalka, A (2020) Constructing a Bilingual Hadith Corpus Using a Segmentation Tool. In: Proceedings of The 12th Language Resources and Evaluation Conference. LREC 2020, 11-16 May 2020, Marseille, France. The European Language Resources Association (ELRA) , pp. 3390-3398. ISBN 979-10-95546-34-4
Abstract
This article describes the process of gathering and constructing a bilingual parallel corpus of Islamic Hadith, which is the set of narratives reporting different aspects of the prophet Muhammad’s life. The corpus data is gathered from the six canonical Hadith collections using a custom segmentation tool that automatically segments and annotates the two Hadith components with 92% accuracy. This Hadith segmenter minimises the costs of language resource creation and produces consistent results independently from previous knowledge and experiences that usually influence human annotators. The corpus includes more than 10M tokens and will be freely available via the LREC repository.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © The European Language Resources Association (ELRA), 2020. This is an open access article under the terms of the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) (https://creativecommons.org/licenses/by-nc/4.0/) |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 09 Jul 2020 14:06 |
Last Modified: | 25 Jun 2023 22:20 |
Published Version: | https://www.aclweb.org/anthology/2020.lrec-1.0/ |
Status: | Published |
Publisher: | The European Language Resources Association (ELRA) |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:163052 |