Discrete-time diffusion-like models for speech synthesis

This is a preprint and may not have undergone formal peer review

Abstract

Diffusion models have attracted a lot of attention in recent years. These models view speech generation as a continuous-time process. For efficient training, this process is typically restricted to additive Gaussian noising, which is limiting. For inference, the time is typically discretized, leading to the mismatch between continuous training and discrete sampling conditions. Recently proposed discrete-time processes, on the other hand, usually do not have these limitations, may require substantially fewer inference steps, and are fully consistent between training/inference conditions. This paper explores some diffusion-like discrete-time processes and proposes some new variants. These include processes applying additive Gaussian noise, multiplicative Gaussian noise, blurring noise and a mixture of blurring and Gaussian noises. The experimental results suggest that discrete-time processes offer comparable subjective and objective speech quality to their widely popular continuous counterpart, with more efficient and consistent training and inference schemas.

Metadata

Item Type:	Preprint
Authors/Creators:	Tan, X. Zhao, M. Ragni, A.
Copyright, Publisher and Additional Information:	© 2025 The Author(s). This preprint is made available under a Creative Commons Attribution 4.0 International License. (https://creativecommons.org/licenses/by/4.0/)
Keywords:	diffusion models; flow matching; iterative process; speech synthesis
Dates:	Submitted: 13 October 2025
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Date Deposited:	06 Jan 2026 15:34
Last Modified:	06 Jan 2026 15:34
Status:	Submitted
Identification Number:	10.48550/arXiv.2509.18470
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:236035

Download

Preprint

Filename: 2509.18470v2.pdf

Licence: CC-BY 4.0

CLICK TO DOWNLOAD

CORE (COnnecting REpositories)

Discrete-time diffusion-like models for speech synthesis

Abstract

Metadata

Download

Preprint

Export

Statistics