Meutzner, H., Ma, N. orcid.org/0000-0002-4112-3109, Nickel, R. et al. (2 more authors) (2017) Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates. In: Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017). 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017), 05-09 Mar 2017, New Orleans, Louisiana, USA. IEEE , pp. 5320-5324. ISBN 9781509041176
Abstract
Audio-visual speech recognition is a promising approach to tackling the problem of reduced recognition rates under adverse acoustic conditions. However, finding an optimal mechanism for combining multi-modal information remains a challenging task. Various methods are applicable for integrating acoustic and visual information in Gaussian-mixture-model-based speech recognition, e.g., via dynamic stream weighting. The recent advances of deep neural network (DNN)-based speech recognition promise improved performance when using audio-visual information. However, the question of how to optimally integrate acoustic and visual information remains. In this paper, we propose a state-based integration scheme that uses dynamic stream weights in DNN-based audio-visual speech recognition. The dynamic weights are obtained from a time-variant reliability estimate that is derived from the audio signal. We show that this state-based integration is superior to early integration of multi-modal features, even if early integration also includes the proposed reliability estimate. Furthermore, the proposed adaptive mechanism is able to outperform a fixed weighting approach that exploits oracle knowledge of the true signal-to-noise ratio.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works. Reproduced in accordance with the publisher's self-archiving policy. |
Keywords: | audio-visual speech recognition; deep neural networks; feature fusion; dynamic stream weighting |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 25 Sep 2017 08:29 |
Last Modified: | 19 Dec 2022 13:36 |
Published Version: | https://doi.org/10.1109/ICASSP.2017.7953172 |
Status: | Published |
Publisher: | IEEE |
Refereed: | Yes |
Identification Number: | 10.1109/ICASSP.2017.7953172 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:121254 |