AMALGAM: automatic mapping among lexicogrammatical annotation models

Atwell, ES, Hughes, J and Souter, DC (1994) AMALGAM: automatic mapping among lexicogrammatical annotation models. In: Klavans, J, (ed.) The Balancing Act: Combining Symbolic and Statistical Approaches to Language - Proceedings of the ACL Workshop. The Balancing Act: Combining Symbolic and Statistical Approaches to Language, 01 Jul 1994, New Mexico State University Las Cruces, New Mexico, USA. Association for Computational Linguistics , 21 - 20.

Abstract

Several Corpus Linguistics research groups have gone beyond collation of 'raw' text, to syntactic annotation of the text. However, linguists developing these linguistic resources have used quite different wordtagging and parse-tree labelling schemes in each of these annotated corpora. This restricts the accessibility of each corpus, making it impossible for speech and handwriting researchers to collate them into a single very large training set. This is particularly problematic as there is evidence that one of these parsed corpora on its own is too small for a general statistical model of grammatical structure, but the combined size of all the above annotated corpora should deliver a much more reliable model. We are developing a set of mapping algorithms to map between the main tagsets and phrase structure grammar schemes used in the above corpora. We plan to develop a Multi-tagged Corpus and a MultiTreebank, a single text-set annotated with all the above tagging and parsing schemes. The text-set is the Spoken English Corpus: this is a half-way house between formal written text and colloquial conversational speech. However, the main deliverable to the computational linguistics research community is not the SEC-based MultiTreebank, but the mapping suite used to produce it - this can be used to combine currently-incompatible syntactic training sets into a large unified multicorpus. Our architecture combines standard statistical language modelling and a rule-base derived from linguists' analyses of tagset-mappings, in a novel yet intuitive way. Our development of the mapping algorithms aims to distinguish notational from substantive differences in the annotation schemes, and we will be able to evaluate tagging schemes in terms of how well they fit standard statistical language models such as n-pos (Markov) models.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Atwell, ES Hughes, J Souter, DC
Editors:	Klavans, J
Copyright, Publisher and Additional Information:	Atwell, ES, Hughes, J and Souter, DC (c) 1994, University of Leeds. Reproduced with permission from the copyright holders.
Dates:	Published: 1994
Institution:	The University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) > Artificial Intelligence & Biological Systems (Leeds)
Depositing User:	Symplectic Publications
Date Deposited:	25 Nov 2014 11:28
Last Modified:	26 Jan 2018 02:48
Published Version:	https://www.aclweb.org/anthology/W/W94/
Status:	Published
Publisher:	Association for Computational Linguistics
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:81160

CORE (COnnecting REpositories)

AMALGAM: automatic mapping among lexicogrammatical annotation models

Abstract

Metadata

Download

AMALGAM

Export

Statistics