Lee, J.L., Chen, L., Lam, C. et al. (2 more authors) (2022) PyCantonese: Cantonese Linguistics and NLP in Python. In: Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Odijk, J. and Piperidis, S., (eds.) Proceedings of the Thirteenth Language Resources and Evaluation Conference. Thirteenth Language Resources and Evaluation Conference (LREC 2022), 20-25 Jun 2022, Marseille, France. European Language Resources Association, pp. 6607-6611. ISBN: 979-10-95546-72-6.
Abstract
This paper introduces PyCantonese, an open-source Python library for Cantonese linguistics and natural language processing. After the library design, implementation, corpus data format, and key datasets included are introduced, the paper provides an overview of the currently implemented functionality: stop words, handling Jyutping romanization, word segmentation, part-of-speech tagging, and parsing Cantonese text.
Metadata
| Item Type: | Proceedings Paper |
|---|---|
| Authors/Creators: |
|
| Editors: |
|
| Copyright, Publisher and Additional Information: | © European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0 |
| Keywords: | Cantonese, Jyutping, word segmentation, part-of-speech tagging, stop words |
| Dates: |
|
| Institution: | The University of Leeds |
| Academic Units: | The University of Leeds > Faculty of Arts, Humanities and Cultures (Leeds) > School of Languages Cultures & Societies (Leeds) |
| Date Deposited: | 19 Jan 2026 14:14 |
| Last Modified: | 19 Jan 2026 14:14 |
| Published Version: | https://aclanthology.org/2022.lrec-1.711/ |
| Status: | Published |
| Publisher: | European Language Resources Association |
| Related URLs: | |
| Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:236126 |
Download
Filename: PyCantonese Cantonese Linguistics and NLP in Python.pdf
Licence: CC-BY-NC 4.0
CORE (COnnecting REpositories)
CORE (COnnecting REpositories)