Atwell, ES, Hughes, J and Souter, DC (1994) A unified multicorpus for training syntactic constraint models. In: Evett, L and Rose, T, (eds.) Workshop on Computational Linguistics for Speech and Handwriting Recognition. 1994 AISB Workshop on Computational Linguistics for Speech and Handwriting Recognition, 12 April 1994, University of Leeds, UK. AISB , 111 - 118.
Abstract
Tagged and parsed corpora (LOB, Brown, London-Lund, ICE, Lancaster-IBM, PoW, Nijmegen, UPenn, BNC, etc) are used as training data for statistical syntactic constraint models to improve recognition accuracy in speech and handwriting recognisers. However, linguists developing these linguistic resources have used quite different wordtagging and parse-tree labelling schemes in each of these annotated corpora. This restricts the accessibility of each corpus, making it impossible for speech and handwriting researchers to collate them into a single very large training set. This is particularly problematic as there is evidence that one of these parsed corpora on its own is too small for a general statistical model of higher-level syntactic structure, but the combined size of all the above annotated corpora should deliver a much more reliable model. We are developing a set of mapping algorithms to map between the main tagsets and phrase structure grammar schemes used in the above corpora. We will develop a Multi-tagged Corpus and a MultiTreebank, a single text-set annotated with all the above tagging and parsing schemes. The text-set is the Spoken English Corpus; this is already annotated with two syntax schemes, and we plan to have added at least one more by the AISB Workshop. However, the main deliverable to the speech and handwriting research community is not the SEC-based MultiTreebank, but the mapping suite used to produce it - this can be used to combine currently-incompatible syntactic training sets into a large unified multicorpus. Our development of the mapping algorithms aims to distinguish notational from substantive differences in the annotation schemes, and we will be able to evaluate tagging schemes in terms of how well they fit standard statistical language models such as n-pos (Markov) models.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Editors: |
|
Copyright, Publisher and Additional Information: | Atwell, ES, Hughes, J and Souter, DC (c) 1994, University of Leeds. Reproduced with permission from the copyright holders. |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) > Artificial Intelligence & Biological Systems (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 12 Jan 2015 12:12 |
Last Modified: | 20 Jan 2018 20:39 |
Published Version: | http://www.aisb.org.uk/ |
Status: | Published |
Publisher: | AISB |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:82273 |