Boito, M.Z., Bérard, A., Villavicencio, A. orcid.org/0000-0002-3731-9168 et al. (1 more author) (2018) Unwritten languages demand attention too! Word discovery with encoder-decoder models. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 16-20 Dec 2017, Okinawa, Japan. IEEE , pp. 458-465. ISBN 9781509047895
Abstract
Word discovery is the task of extracting words from un-segmented text. In this paper we examine to what extent neural networks can be applied to this task in a realistic unwritten language scenario, where only small corpora and limited annotations are available. We investigate two scenarios: one with no supervision and another with limited supervision with access to the most frequent words. Obtained results show that it is possible to retrieve at least 27% of the gold standard vocabulary by training an encoder-decoder neural machine translation system with only 5,157 sentences. This result is close to those obtained with a task-specific Bayesian nonparametric model. Moreover, our approach has the advantage of generating translation alignments, which could be used to create a bilingual lexicon. As a future perspective, this approach is also well suited to work directly from speech.
Metadata
| Item Type: | Proceedings Paper | 
|---|---|
| Authors/Creators: | 
 | 
| Copyright, Publisher and Additional Information: | © 2017 IEEE. | 
| Keywords: | Word Discovery; Computational Language Documentation; Neural Machine Translation; Attention models | 
| Dates: | 
 | 
| Institution: | The University of Sheffield | 
| Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) | 
| Depositing User: | Symplectic Sheffield | 
| Date Deposited: | 25 Nov 2019 11:51 | 
| Last Modified: | 25 Nov 2019 11:51 | 
| Status: | Published | 
| Publisher: | IEEE | 
| Refereed: | Yes | 
| Identification Number: | 10.1109/ASRU.2017.8268972 | 
| Related URLs: | |
| Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:153555 | 

 CORE (COnnecting REpositories)
 CORE (COnnecting REpositories) CORE (COnnecting REpositories)
 CORE (COnnecting REpositories)