Boito, M.Z., Bérard, A., Villavicencio, A. orcid.org/0000-0002-3731-9168 et al. (1 more author) (2018) Unwritten languages demand attention too! Word discovery with encoder-decoder models. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 16-20 Dec 2017, Okinawa, Japan. IEEE , pp. 458-465. ISBN 9781509047895
Abstract
Word discovery is the task of extracting words from un-segmented text. In this paper we examine to what extent neural networks can be applied to this task in a realistic unwritten language scenario, where only small corpora and limited annotations are available. We investigate two scenarios: one with no supervision and another with limited supervision with access to the most frequent words. Obtained results show that it is possible to retrieve at least 27% of the gold standard vocabulary by training an encoder-decoder neural machine translation system with only 5,157 sentences. This result is close to those obtained with a task-specific Bayesian nonparametric model. Moreover, our approach has the advantage of generating translation alignments, which could be used to create a bilingual lexicon. As a future perspective, this approach is also well suited to work directly from speech.
Metadata
| Item Type: | Proceedings Paper |
|---|---|
| Authors/Creators: |
|
| Copyright, Publisher and Additional Information: | © 2017 IEEE. |
| Keywords: | Word Discovery; Computational Language Documentation; Neural Machine Translation; Attention models |
| Dates: |
|
| Institution: | The University of Sheffield |
| Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
| Depositing User: | Symplectic Sheffield |
| Date Deposited: | 25 Nov 2019 11:51 |
| Last Modified: | 25 Nov 2019 11:51 |
| Status: | Published |
| Publisher: | IEEE |
| Refereed: | Yes |
| Identification Number: | 10.1109/ASRU.2017.8268972 |
| Related URLs: | |
| Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:153555 |

CORE (COnnecting REpositories)
CORE (COnnecting REpositories)