VisualSpeech: Enhancing Prosody Modeling in TTS Using Video

This is the latest version of this eprint.

Que, S. and Ragni, A. orcid.org/0000-0003-0634-4456 (2025) VisualSpeech: Enhancing Prosody Modeling in TTS Using Video. In: Scharenborg, O., Oertel, C. and Truong, K., (eds.) Proceedings of Interspeech 2025. Interspeech 2025, 17-21 Aug 2025, Rotterdam, The Netherlands. . International Speech Communication Association (ISCA), pp. 3778-3782. ISSN: 2958-1796. EISSN: 2958-1796.

Abstract

Text-to-Speech (TTS) synthesis faces the inherent challenge of producing multiple speech outputs with varying prosody given a single text input. While previous research has addressed this by predicting prosodic information from both text and speech, additional contextual information, such as video, remains under-utilized despite being available in many applications. This paper investigates the potential of integrating visual context to enhance prosody prediction. We propose a novel model, VisualSpeech, which incorporates visual and textual information for improving prosody generation in TTS. Empirical results indicate that incorporating visual features improves prosodic modeling, enhancing the expressiveness of the synthesized speech.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Que, S. Ragni, A. https://orcid.org/0000-0003-0634-4456
Editors:	Scharenborg, O. Oertel, C. Truong, K.
Copyright, Publisher and Additional Information:	© 2025 The Authors. Except as otherwise noted, this author-accepted version of a paper published in Proceedings of Interspeech 2025 is made available via the University of Sheffield Research Publications and Copyright Policy under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ © 2025 ISCA. Reproduced in accordance with the publisher's self-archiving policy.
Keywords:	Text-to-speech Synthesis; Video; Visual Features; Prosody
Dates:	Published (online): 17 August 2025 Published: 17 August 2025
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Date Deposited:	24 Sep 2025 14:08
Last Modified:	24 Sep 2025 14:08
Published Version:	https://www.isca-archive.org/interspeech_2025/que2...
Status:	Published
Publisher:	International Speech Communication Association (ISCA)
Refereed:	Yes
Identification Number:	10.21437/Interspeech.2025-1494
Related URLs:	arXiv URL Conference
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:232120

Available Versions of this Item

VisualSpeech: enhance prosody with visual context in TTS. (deposited 21 May 2025 12:48)
- VisualSpeech: Enhancing Prosody Modeling in TTS Using Video. (deposited 24 Sep 2025 14:08) [Currently Displayed]

CORE (COnnecting REpositories)

VisualSpeech: Enhancing Prosody Modeling in TTS Using Video

Abstract

Metadata

Available Versions of this Item

Downloads

Accepted Version

Published Version

Export

Statistics