Lam, C., Lau, C.M. and Lee, J. L. (Cover date: 2026) Chinese Language Corpora. UNSPECIFIED. Elsevier ISBN 9780443157851
Abstract
This entry outlines common features and characteristics of corpora in Chinese languages. It discusses the diversity within the Sinitic language family, emphasizing the differences between Mandarin and other major varieties such as Cantonese, Hokkien, and Hakka. The role of orthographic variations, particularly the distinction between Traditional and Simplified characters, is explored in the context of corpus construction and annotation. This entry highlights the availability of contemporary and historical corpora and addresses both theoretical and technical challenges in developing Chinese corpora, including word segmentation, diglossia, and encoding standards. Practical aspects such as transcription, text normalization, and data collection are also examined. The discussion underscores the specific challenges posed by linguistic and orthographical features in Chinese corpora and how they necessitate unique computational solutions. This overview serves as a guide for readers seeking an introduction to corpus studies of Chinese languages.
Metadata
| Item Type: | Monograph |
|---|---|
| Authors/Creators: |
|
| Dates: |
|
| Institution: | The University of Leeds |
| Academic Units: | The University of Leeds > Faculty of Arts, Humanities and Cultures (Leeds) > School of Languages Cultures & Societies (Leeds) |
| Date Deposited: | 02 Jun 2026 16:04 |
| Last Modified: | 02 Jun 2026 16:05 |
| Published Version: | https://www.sciencedirect.com/science/chapter/refe... |
| Status: | Published |
| Publisher: | Elsevier |
| Identification Number: | 10.1016/B978-0-323-95504-1.01495-2 |
| Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:241502 |
CORE (COnnecting REpositories)
CORE (COnnecting REpositories)