The impact of automatic exaggeration of the visual articulatory features of a talker on the intelligibility of spectrally distorted speech

Abstract

Visual speech information plays a key role in supporting speech perception, especially when acoustic features are distorted or inaccessible. Recent research suggests that for spectrally distorted speech, the use of visual speech in auditory training improves not only subjects’ audiovisual speech recognition, but also their subsequent auditory-only speech recognition. Visual speech cues, however, can be affected by a number of facial visual signals that vary across talkers, such as lip emphasis and speaking style. In a previous study, we enhanced the visual speech videos used in perception training by automatically tracking and colouring a talker’s lips. This improved the subjects’ audiovisual and subsequent auditory speech recognition compared with those who were trained via unmodified videos or audio-only methods. In this paper, we report on two issues related to automatic exaggeration of the movement of the lips/ mouth area. First, we investigate subjects’ ability to adapt to the conflict between the articulation energy in the visual signals and the vocal effort in the acoustic signals (since the acoustic signals remained unexaggerated). Second, we have examined whether or not this visual exaggeration can improve the subjects’ performance of auditory and audiovisual speech recognition when used in perception training. To test this concept, we used spectrally distorted speech to train groups of listeners using four different training regimes: (1) audio only, (2) audiovisual, (3) audiovisual visually exaggerated, and (4) audiovisual visually exaggerated and lip-coloured. We used spectrally distorted speech (cochlear-implant-simulated speech) because the longer-term aim of our work is to employ these concepts in a training system for cochlear-implant (CI) users. The results suggest that after exposure to visually exaggerated speech, listeners had the ability to adapt alongside the conflicting audiovisual signals. In addition, subjects trained with enhanced visual cues (regimes 3 and 4) achieved better audiovisual recognition for a number of phoneme classes than those who were trained with unmodified visual speech (regime 2). There was no evidence of an improvement in the subsequent audio-only listening skills, however. The subjects’ adaptation to the conflicting audiovisual signals may have slowed down auditory perceptual learning, and impeded the ability of the visual speech to improve the training gains.

Metadata

Item Type:	Article
Authors/Creators:	Alghamdi, N. Maddock, S. Barker, J. Brown, G.J. https://orcid.org/0000-0001-8565-5476
Copyright, Publisher and Additional Information:	© 2017 Published by Elsevier B.V. This is an author produced version of a paper subsequently published in Speech Communication. Uploaded in accordance with the publisher's self-archiving policy. Article available under the terms of the CC-BY-NC-ND licence (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Keywords:	Audiovisual training; Cochlear-implant simulation; Visual-speech enhancement; Lombard speech
Dates:	Accepted: 28 August 2017 Published (online): 31 August 2017 Published: December 2017
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Depositing User:	Symplectic Sheffield
Date Deposited:	01 Sep 2017 09:45
Last Modified:	07 Nov 2023 08:28
Status:	Published
Publisher:	Elsevier
Refereed:	Yes
Identification Number:	10.1016/j.specom.2017.08.010
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:120730

CORE (COnnecting REpositories)

The impact of automatic exaggeration of the visual articulatory features of a talker on the intelligibility of spectrally distorted speech

Abstract

Metadata

Download

Accepted Version

Export

Statistics