Alahmari, S. orcid.org/0009-0002-6490-3295 (2025) SADSLyC: A Corpus for Saudi Arabian Multi-dialect Identification through Song Lyrics. In: Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4). The 31st International Conference on Computational Linguistics (COLING 2025), 19-24 Jan 2025, Abu Dhabi, UAE. The Association for Computational Linguistics , pp. 38-43. ISBN 979-8-89176-220-6
Abstract
This paper presents the Saudi Arabian Dialects Song Lyrics Corpus (SADSLyC), the first dataset featuring song lyrics from the five major Saudi dialects: Najdi (Central Region), Hijazi (Western Region), Shamali (Northern Region), Janoubi (Southern Region), and Shargawi (Eastern Region). The dataset consists of 31,358 sentences, with each sentence representing a self-contained verse in a song, totaling 151,841 words. Additionally, we present a baseline experiment using the SaudiBERT model to classify the fine-grained dialects in the SADSLyC Corpus. The model achieved an overall accuracy of 73% on the test dataset.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | ACL materials are Copyright © 1963–2025 ACL; other materials are copyrighted by their respective copyright holders. Materials prior to 2016 here are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License. Permission is granted to make copies for the purposes of teaching and research. Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License. |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 06 Feb 2025 11:29 |
Last Modified: | 06 Feb 2025 11:30 |
Published Version: | https://aclanthology.org/2025.wacl-1.4/ |
Status: | Published |
Publisher: | The Association for Computational Linguistics |
Related URLs: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:222902 |
Download
Filename: SADSLyC A Corpus for Saudi Arabian Multi-dialect Identification through Song Lyrics.pdf
Licence: CC-BY 4.0