KCLVA: Knowledge-enhanced Contrastive Learning and View-specific Attention for Chest X-ray Report Generation

Zhu, J. and Lu, P. orcid.org/0000-0002-0199-3783 (2026) KCLVA: Knowledge-enhanced Contrastive Learning and View-specific Attention for Chest X-ray Report Generation. In: Ali, S., Hogg, D.C. and Peckham, M., (eds.) Medical Image Understanding and Analysis. 29th UK Conference on Medical Image Understanding and Analysis (MIUA), 15-17 Jul 2025, Leeds, UK. Lecture Notes in Computer Science, 15916 . Springer Nature , Cham, Switzerland , pp. 187-204. ISBN: 978-3-031-98687-1 ISSN: 0302-9743 EISSN: 1611-3349

Abstract

In clinical scenarios, radiologists analyse multiple chest X-ray (CXR) images from various view positions to identify diseases and abnormalities. To replicate the diagnostic approach of experienced radiologists, we propose an encoder-decoder-based CXR report generation architecture, KCLVA, which leverages the Unified Medical Language System (UMLS) to extract view-specific information from diagnostic reports, focusing on posteroanterior, anteroposterior, and lateral views. This extracted information facilitates view-specific attention (VA) mechanisms and is subsequently used to construct a similarity matrix that enables many-to-many contrastive learning. In the encoder, we employ a knowledge distillation architecture to guide the learning of the student model by freezing the teacher model. Within the student text encoder, the VA mechanism is utilised to automatically assign higher weights to tokens corresponding to a specific view in diagnostic reports based on the view position of the CXR, while assigning lower weights to other tokens. The image and text features are then integrated using contrastive learning. In the decoder, a transformer-based backbone architecture is employed to decode the encoder output and generate a medical diagnosis report. This strategy leverages UMLS to extract view-specific information, employs VA to adjust token weights, and utilises many-to-many contrastive learning through a weighted contrastive loss. Together, these components enable our model to closely simulate the diagnostic process of professional radiologists. Consequently, our method achieves significant improvements of 0.185 on METEOR and 0.078 on ROUGE compared to previous approaches.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Zhu, J. Lu, P. https://orcid.org/0000-0002-0199-3783
Editors:	Ali, S. Hogg, D.C. Peckham, M.
Copyright, Publisher and Additional Information:	This is an author produced version of a conference paper published in Medical Image Understanding and Analysis made available under the terms of the Creative Commons Attribution License (CC-BY), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited.
Dates:	Accepted: 13 May 2025 Published (online): 2025 Published: 2026
Institution:	The University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds)
Depositing User:	Symplectic Publications
Date Deposited:	04 Jul 2025 15:23
Last Modified:	21 Aug 2025 08:41
Published Version:	https://link.springer.com/chapter/10.1007/978-3-03...
Status:	Published
Publisher:	Springer Nature
Series Name:	Lecture Notes in Computer Science
Identification Number:	10.1007/978-3-031-98688-8_14
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:228682

CORE (COnnecting REpositories)

KCLVA: Knowledge-enhanced Contrastive Learning and View-specific Attention for Chest X-ray Report Generation

Abstract

Metadata

Download

Accepted Version

Export

Statistics