Lam, C., Lau, C.M. and Lee, J.L. (2024) Multi-Tiered Cantonese Word Segmentation. In: Calzolari, N., Kan, M-Y., Hoste, V., Lenci, A., Sakti, S. and Xue, N., (eds.) Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 20-25 May 2024, Torino, Italia. ELRA and ICCL, pp. 11993-12002. ISBN: 978-2-493814-10-4.
Abstract
Word segmentation for Chinese text data is essential for compiling corpora and any other tasks where the notion of “word” is assumed, since Chinese orthography does not have conventional word boundaries as languages such as English do. A perennial issue, however, is that there is no consensus about the definition of “word” in Chinese, which makes word segmentation challenging. Recent work in Chinese word segmentation has begun to embrace the idea of multiple word segmentation possibilities. In a similar spirit, this paper focuses on Cantonese, another major Chinese variety. We propose a linguistically motivated, multi-tiered word segmentation system for Cantonese, and release a Cantonese corpus of 150,000 characters word-segmented by this proposal. Our work will be of interest to researchers whose work involves Cantonese corpus data.
Metadata
| Item Type: | Proceedings Paper |
|---|---|
| Authors/Creators: |
|
| Editors: |
|
| Copyright, Publisher and Additional Information: | © 2024 ELRA Language Resource Association: CC BY-NC 4.0 |
| Keywords: | Word Segmentation, Cantonese |
| Dates: |
|
| Institution: | The University of Leeds |
| Academic Units: | The University of Leeds > Faculty of Arts, Humanities and Cultures (Leeds) > School of Languages Cultures & Societies (Leeds) |
| Date Deposited: | 19 Jan 2026 13:00 |
| Last Modified: | 19 Jan 2026 13:08 |
| Published Version: | https://aclanthology.org/2024.lrec-main.1047/ |
| Status: | Published |
| Publisher: | ELRA and ICCL |
| Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:236125 |
Download
Filename: Multi-Tiered Cantonese Word Segmentation.pdf
Licence: CC-BY-NC 4.0
CORE (COnnecting REpositories)
CORE (COnnecting REpositories)