Talking Head from Speech Audio using a Pre-trained Image Generator

Alghamdi, MM, Wang, H orcid.org/0000-0002-2281-5679, Bulpitt, AJ orcid.org/0000-0002-7905-4540 et al. (1 more author) (2022) Talking Head from Speech Audio using a Pre-trained Image Generator. In: Proceedings of the 30th ACM International Conference on Multimedia. MM '22: The 30th ACM International Conference on Multimedia, 10-14 Oct 2022, Lisboa, Portugal. ACM , pp. 5228-5236. ISBN 978-1-4503-9203-7

Abstract

We propose a novel method for generating high-resolution videos of talking-heads from speech audio and a single 'identity' image. Our method is based on a convolutional neural network model that incorporates a pre-trained StyleGAN generator. We model each frame as a point in the latent space of StyleGAN so that a video corresponds to a trajectory through the latent space. Training the network is in two stages. The first stage is to model trajectories in the latent space conditioned on speech utterances. To do this, we use an existing encoder to invert the generator, mapping from each video frame into the latent space. We train a recurrent neural network to map from speech utterances to displacements in the latent space of the image generator. These displacements are relative to the back-projection into the latent space of an identity image chosen from the individuals depicted in the training dataset. In the second stage, we improve the visual quality of the generated videos by tuning the image generator on a single image or a short video of any chosen identity. We evaluate our model on standard measures (PSNR, SSIM, FID and LMD) and show that it significantly outperforms recent state-of-the-art methods on one of two commonly used datasets and gives comparable performance on the other. Finally, we report on ablation experiments that validate the components of the model. The code and videos from experiments can be found at https://mohammedalghamdi.github.io/talking-heads-acm-mm/

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Alghamdi, MM Wang, H https://orcid.org/0000-0002-2281-5679 Bulpitt, AJ https://orcid.org/0000-0002-7905-4540 Hogg, DC https://orcid.org/0000-0002-6125-9564
Copyright, Publisher and Additional Information:	© 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM. This is an author produced version of an article published in Proceedings of the 30th ACM International Conference on Multimedia. Uploaded in accordance with the publisher's self-archiving policy.
Dates:	Published (online): 10 October 2022 Published: 10 October 2022
Institution:	The University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds)
Depositing User:	Symplectic Publications
Date Deposited:	17 Apr 2023 12:25
Last Modified:	17 Apr 2023 12:25
Published Version:	http://dx.doi.org/10.1145/3503161.3548101
Status:	Published
Publisher:	ACM
Identification Number:	10.1145/3503161.3548101
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:198157

CORE (COnnecting REpositories)

Talking Head from Speech Audio using a Pre-trained Image Generator

Abstract

Metadata

Download

Accepted Version

Export

Statistics