Cardiac Ultrasound Video Generation Using a Diffusion Model with Temporal Transformer

Wang, W. and Lu, P. orcid.org/0000-0002-0199-3783 (2026) Cardiac Ultrasound Video Generation Using a Diffusion Model with Temporal Transformer. In: Ali, S., Hogg, D.C. and Peckham, M., (eds.) Medical Image Understanding and Analysis. 29th UK Conference on Medical Image Understanding and Analysis (MIUA), 15-17 Jul 2025, Leeds, UK. Lecture Notes in Computer Science, 15916. Springer Nature, Cham, Switzerland, pp. 174-186. ISBN: 978-3-031-98687-1. ISSN: 0302-9743. EISSN: 1611-3349.

Abstract

Cardiac ultrasound is widely used for the diagnosis and monitoring of cardiovascular diseases due to its noninvasive nature, real-time imaging capability, and low cost. However, its clinical utility is often limited by noise sensitivity and acquisition variability, which adversely affect automated interpretation and sequence consistency. To overcome these limitations, this paper presents a multimodal deep learning framework that combines a denoising diffusion model with a Temporal Transformer to generate high-quality cardiac ultrasound videos. A unified preprocessing pipeline with intensity normalisation and standardisation is employed to reduce intersample variation and enhance anatomical structures. Spatial features are first extracted from individual frames, followed by temporal modelling across sequences using the Temporal Transformer. These features guide the latent-space denoising process, optionally augmented by ControlNet for structure-aware generation. The experimental results demonstrate that the proposed method achieves robust performance, with an FID of 43.50, an FVD of 274.52, and an inception score of 8.62. Ablation studies further verify the critical contributions of ControlNet and composite loss design, highlighting the effectiveness of the framework in ensuring both spatial fidelity and temporal coherence.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Wang, W. Lu, P. https://orcid.org/0000-0002-0199-3783
Editors:	Ali, S. Hogg, D.C. Peckham, M.
Copyright, Publisher and Additional Information:	This is an author produced version of a conference paper published in Medical Image Understanding and Analysis made available under the terms of the Creative Commons Attribution License (CC-BY), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited.
Keywords:	Cardiac Ultrasound, Diffusion Model, Temporal Transformer, ControlNet, Multimodal Generation
Dates:	Accepted: 13 July 2025 Published (online): 17 July 2025 Published: 2026
Institution:	The University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds)
Date Deposited:	04 Jul 2025 14:10
Last Modified:	14 Jan 2026 11:30
Published Version:	https://link.springer.com/book/10.1007/978-3-031-9...
Status:	Published
Publisher:	Springer Nature
Series Name:	Lecture Notes in Computer Science
Identification Number:	10.1007/978-3-031-98688-8_13
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:228681

CORE (COnnecting REpositories)

Cardiac Ultrasound Video Generation Using a Diffusion Model with Temporal Transformer

Abstract

Metadata

Download

Accepted Version

Export

Statistics