Atwell, ES (2004) Clustering of word types and unification of word tokens into grammatical word-classes. In: Bel, B and Marlien, I, (eds.) Proceedings of TALN04: XI Conference sur le Traitement Automatique des Langues Naturelles. TALN04: XI Conference sur le Traitement Automatique des Langues Naturelles, 19-22 Apr 2004, Fez, Morocco. ATALA , 27 - 32. ISBN 2-9518233-4-7
Abstract
This paper discusses Neopsy: unsupervised inference of grammatical word-classes in Natural Language. Grammatical inference can be divided into inference of grammatical word-classes and inference of structure. We review the background of supervised learning of Part-of-Speech tagging; and discuss the adaptation of the three main types of Part-of-Speech tagger to unsupervised inference of grammatical word-classes. Statistical N-gram taggers suggest a statistical clustering approach, but clustering does not help with low-frequency word-types, or with the many word-types which can appear in more than one grammatical category. The alternative Transformation-Based Learning tagger suggests a constraint-based approach of unification of word-token contexts. This offers a way to group together low-frequency word-types, and allows different tokens of one word-type to belong to different categories according to grammatical contexts they appear in. However, simple unification of word-token-contexts yields an implausibly large number of Part-of-Speech categories; we have attempted to merge more categories by "relaxing" matching context to allow unification of word-categories as well as word-tokens, but this results in spurious unifications. We conclude that the way ahead may be a hybrid involving clustering of frequent word-types, unification of word-token-contexts, and "seeding" with limited linguistic knowledge. We call for a programme of further research to develop a Language Discovery Toolkit.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Editors: |
|
Keywords: | Corpus; Part-of-Speech tagging; clustering; unification; world classes; type/token; evaluation |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) > Artificial Intelligence & Biological Systems (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 15 Jan 2015 12:54 |
Last Modified: | 19 Dec 2022 13:29 |
Status: | Published |
Publisher: | ATALA |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:82297 |