Wang, W. and Lu, P. orcid.org/0000-0002-0199-3783 (2026) Cardiac Ultrasound Video Generation Using a Diffusion Model with Temporal Transformer. In: Ali, S., Hogg, D.C. and Peckham, M., (eds.) Medical Image Understanding and Analysis. 29th UK Conference on Medical Image Understanding and Analysis (MIUA), 15-17 Jul 2025, Leeds, UK. Lecture Notes in Computer Science, 15916 . Springer Nature , Cham, Switzerland , pp. 174-186. ISBN: 978-3-031-98687-1 ISSN: 0302-9743 EISSN: 1611-3349
Abstract
Cardiac ultrasound is widely used for the diagnosis and monitoring of cardiovascular diseases due to its noninvasive nature, real-time imaging capability, and low cost. However, its clinical utility is often limited by noise sensitivity and acquisition variability, which adversely affect automated interpretation and sequence consistency. To overcome these limitations, this paper presents a multimodal deep learning framework that combines a denoising diffusion model with a Temporal Transformer to generate high-quality cardiac ultrasound videos. A unified preprocessing pipeline with intensity normalisation and standardisation is employed to reduce intersample variation and enhance anatomical structures. Spatial features are first extracted from individual frames, followed by temporal modelling across sequences using the Temporal Transformer. These features guide the latent-space denoising process, optionally augmented by ControlNet for structure-aware generation. The experimental results demonstrate that the proposed method achieves robust performance, with an FID of 43.50, an FVD of 274.52, and an inception score of 8.62. Ablation studies further verify the critical contributions of ControlNet and composite loss design, highlighting the effectiveness of the framework in ensuring both spatial fidelity and temporal coherence.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Editors: |
|
Copyright, Publisher and Additional Information: | This is an author produced version of a conference paper published in Medical Image Understanding and Analysis made available under the terms of the Creative Commons Attribution License (CC-BY), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. |
Keywords: | Cardiac Ultrasound, Diffusion Model, Temporal Transformer, ControlNet, Multimodal Generation |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 04 Jul 2025 14:10 |
Last Modified: | 20 Aug 2025 14:58 |
Published Version: | https://link.springer.com/book/10.1007/978-3-031-9... |
Status: | Published |
Publisher: | Springer Nature |
Series Name: | Lecture Notes in Computer Science |
Identification Number: | 10.1007/978-3-031-98688-8_13 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:228681 |