Hughes, J and Atwell, ES (1994) A methodical approach to word class formation using automatic evaluation. In: Evett, L and Rose, T, (eds.) Proceedings of the 1994 AISB Workshop on Computational Linguistics for Speech and Handwriting Recognition. 1994 AISB Workshop on Computational Linguistics for Speech and Handwriting Recognition, 12 April 1994, University of Leeds, UK. AISB , 41 - 48.
Abstract
Automatic inference of a classification of words has been carried out by several researchers recently. Although they use a variety of methods they all exploit the statistical redundancy inherent in the structure of language to differentiate words; the assumption being that words of similar roles occur in measurably similar contexts. This paper describes a general method by which clustering schemes can be qualitatively compared. This allows a systematic approach to finding the best word class formation scheme to be adopted. The process by which words are automatically grouped into classes involves a number of decision points. These include: the contextual pattern in the language being measured; the metric by which words are compared according to the pattern; and the mechanism by which items judged to be similar are merged. Alternatives are presented for each of these factors. The experiments rated each combination so that the most successful approach can be found. Previously, researcher relied on a looks-good-to-me method of self-evaluation to the judge the quality of their derived word classifications. This paper directly compares some of their adopted approaches with alternative clustering schemes not previously attempted. This allows us to formally demonstrate when our approach to clustering is more successful. The evaluation method is also shown to be a valuable aid to highlighting approaches that are inefficient. Amongst the patterns investigated were the morphological context supplied by the previous words. Bigram counts of the collocation of the words to be clustered with the last three letters of the word immediately before were found to be a remarkably good differentiation criteria. The evaluation method demonstrated that the context of the last three letters (which on average contain a lot of morphological information in English) is even better that the context supplied by using the whole of the previous word in collocation counts. Results such as this should prove useful to handwriting recognition research. The authors believe this method provides a sensible first step for handwriting recognition researchers who wish to use statistical models of language to aid the disambiguation process; proposed contextual models can be evaluated relative to previously investigated models to indicate the likely success rate of employing them. This allows a proposed poor disambiguation methods to be ruled out early on and thus is a valuable aid to saving valuable time and resources. We end by considering some further applications of automatic word class formation techniques. Although our experiments are exclusively with English corpus text, the general clustering and word-classifying algorithms should be applicable to text in other languages. This is likely to be particularly useful in development of linguistic engineering technologies for emerging nations and their mother tongues, which have little or no computational linguistics resources or computational linguistics to "hand-craft" them.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Editors: |
|
Copyright, Publisher and Additional Information: | Hughes, J and Atwell, ES (c) 1994, University of Leeds. Reproduced with permission from the copyright holders. |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) > Artificial Intelligence & Biological Systems (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 12 Jan 2015 11:04 |
Last Modified: | 26 Apr 2015 18:21 |
Published Version: | http://www.aisb.org.uk/ |
Status: | Published |
Publisher: | AISB |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:82271 |