Choudhry, O.S. orcid.org/0000-0003-4434-3550, Odida, P.O., Reiner, J. et al. (3 more authors) (2022) Data Collection and Analysis of French Dialects. [Preprint - arXiv]
Abstract
This paper discusses creating and analysing a new dataset for data mining and text analytics research, contributing to a joint Leeds University research project for the Corpus of National Dialects. This report investigates machine learning classifiers to classify samples of French dialect text across various French-speaking countries. Following the steps of the CRISP-DM methodology, this report explores the data collection process, data quality issues and data conversion for text analysis. Finally, after applying suitable data mining techniques, the evaluation methods, best overall features and classifiers and conclusions are discussed.
Metadata
Item Type: | Preprint |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | This is an open access preprint under the terms of the Creative Commons Attribution License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 11 Mar 2025 11:34 |
Last Modified: | 11 Mar 2025 11:34 |
Published Version: | https://arxiv.org/abs/2208.00752 |
Identification Number: | 10.48550/arXiv.2208.00752 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:224286 |