Brierley, C and Atwell, ES (2010) ProPOSEC: A Prosody and PoS Annotated Spoken English Corpus. In: Calzolari, N, Choukri, K, Maegaard, B, Mariani, J, Piperidis, S, Rosner, M and Tapias, D, (eds.) Proceedings of LREC'2010 Language Resources and Evaluation Conference. LREC'2010 Language Resources and Evaluation Conference, 17-23 May 2010, Malta. , 1266 - 1270. ISBN 2-9517408-6-7
Abstract
We have previously reported on ProPOSEL, a purpose-built Prosody and PoS English Lexicon compatible with the Python Natural Language ToolKit. ProPOSEC is a new corpus research resource built using this lexicon, intended for distribution with the Aix-MARSEC dataset. ProPOSEC comprises multi-level parallel annotations, juxtaposing prosodic and syntactic information from different versions of the Spoken English Corpus, with canonical dictionary forms, in a query format optimized for Perl, Python, and text processing programs. The order and content of fields in the text file is as follows: (1) Aix-MARSEC file number; (2) word; (3) LOB PoS-tag; (4) C5 PoS-tag; (5) Aix SAM-PA phonetic transcription; (6) SAM-PA phonetic transcription from ProPOSEL; (7) syllable count; (8) lexical stress pattern; (9) default content or function word tag; (10) DISC stressed and syllabified phonetic transcription; (11) alternative DISC representation, incorporating lexical stress pattern; (12) nested arrays of phonemes and tonic stress marks from Aix. As an experimental dataset, ProPOSEC can be used to study correlations between these annotation tiers, where significant findings are then expressed as additional features for phrasing models integral to Text-to-Speech and Speech Recognition. As a training set, ProPOSEC can be used for machine learning tasks in Information Retrieval and Speech Understanding systems.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Editors: |
|
Copyright, Publisher and Additional Information: | Published with the permission of ELRA. This paper was published within the proceedings of the LREC 2010 Conference. © 2010 ELRA - European Language Resources Association. All rights reserved. |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) > Artificial Intelligence & Biological Systems (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 15 Dec 2014 16:08 |
Last Modified: | 19 Dec 2022 13:29 |
Published Version: | http://www.lrec-conf.org/proceedings/lrec2010/pdf/... |
Status: | Published |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:81706 |