Gales, M.J.F., Knill, K.M. and Ragni, A. orcid.org/0000-0003-0634-4456 (2015) Unicode-based graphemic systems for limited resource languages. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). ICASSP 2015 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 19-24 Apr 2015, Brisbane, QLD, Australia. IEEE ISBN 9781467369978
Abstract
Large vocabulary continuous speech recognition systems require a mapping from words, or tokens, into sub-word units to enable robust estimation of acoustic model parameters, and to model words not seen in the training data. The standard approach to achieve this is to manually generate a lexicon where words are mapped into phones, often with attributes associated with each of these phones. Contextdependent acoustic models are then constructed using decision trees where questions are asked based on the phones and phone attributes. For low-resource languages, it may not be practical to manually generate a lexicon. An alternative approach is to use a graphemic lexicon, where the “pronunciation” for a word is defined by the letters forming that word. This paper proposes a simple approach for building graphemic systems for any language written in unicode. The attributes for graphemes are automatically derived using features from the unicode character descriptions. These attributes are then used in decision tree construction. This approach is examined on the IARPA Babel Option Period 2 languages, and a Levantine Arabic CTS task. The described approach achieves comparable, and complementary, performance to phonetic lexicon-based approaches.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2015 IEEE. |
Keywords: | Low resource speech recognition; graphemic acoustic models |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Funding Information: | Funder Grant number Department of Defense U.S. Army Research Laboratory (DoD/ARL) W911NF-12-C-0012 |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 12 Nov 2019 13:01 |
Last Modified: | 12 Nov 2019 13:01 |
Published Version: | https://ieeexplore.ieee.org/document/7178960 |
Status: | Published |
Publisher: | IEEE |
Refereed: | Yes |
Identification Number: | 10.1109/icassp.2015.7178960 |
Related URLs: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:152834 |