Mendels, G., Cooper, E., Soto, V. et al. (5 more authors) (2015) Improving speech recognition and keyword search for low resource languages using web data. In: INTERSPEECH 2015 : 16th Annual Conference of the International Speech Communication Association. INTERSPEECH 2015 : 16th Annual Conference of the International Speech Communication Association, 06-10 Sep 2015, Dresden, Germany. International Speech Communication Association (ISCA) , pp. 829-833.
Abstract
We describe the use of text data scraped from the web to augment language models for Automatic Speech Recognition and Keyword Search for Low Resource Languages. We scrape text from multiple genres including blogs, online news, translated TED talks, and subtitles. Using linearly interpolated language models, we find that blogs and movie subtitles are more relevant for language modeling of conversational telephone speech and obtain large reductions in out-of-vocabulary keywords. Furthermore, we show that the web data can improve Term Error Rate Performance by 3.8% absolute and Maximum Term-Weighted Value in Keyword Search by 0.0076-0.1059 absolute points. Much of the gain comes from the reduction of out-of-vocabulary items.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2015 International Speech Communication Association (ISCA). Reproduced in accordance with the publisher's self-archiving policy. |
Keywords: | web resources; web scraping; keyword search; low-resource languages |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 13 Nov 2019 09:45 |
Last Modified: | 13 Nov 2019 10:46 |
Published Version: | https://www.isca-speech.org/archive/interspeech_20... |
Status: | Published |
Publisher: | International Speech Communication Association (ISCA) |
Refereed: | Yes |
Related URLs: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:152838 |