Zhu, J. and Lu, P. orcid.org/0000-0002-0199-3783 (2026) KCLVA: Knowledge-enhanced Contrastive Learning and View-specific Attention for Chest X-ray Report Generation. In: Ali, S., Hogg, D.C. and Peckham, M., (eds.) Medical Image Understanding and Analysis. 29th UK Conference on Medical Image Understanding and Analysis (MIUA), 15-17 Jul 2025, Leeds, UK. Lecture Notes in Computer Science, 15916 . Springer Nature , Cham, Switzerland , pp. 187-204. ISBN: 978-3-031-98687-1 ISSN: 0302-9743 EISSN: 1611-3349
Abstract
In clinical scenarios, radiologists analyse multiple chest X-ray (CXR) images from various view positions to identify diseases and abnormalities. To replicate the diagnostic approach of experienced radiologists, we propose an encoder-decoder-based CXR report generation architecture, KCLVA, which leverages the Unified Medical Language System (UMLS) to extract view-specific information from diagnostic reports, focusing on posteroanterior, anteroposterior, and lateral views. This extracted information facilitates view-specific attention (VA) mechanisms and is subsequently used to construct a similarity matrix that enables many-to-many contrastive learning. In the encoder, we employ a knowledge distillation architecture to guide the learning of the student model by freezing the teacher model. Within the student text encoder, the VA mechanism is utilised to automatically assign higher weights to tokens corresponding to a specific view in diagnostic reports based on the view position of the CXR, while assigning lower weights to other tokens. The image and text features are then integrated using contrastive learning. In the decoder, a transformer-based backbone architecture is employed to decode the encoder output and generate a medical diagnosis report. This strategy leverages UMLS to extract view-specific information, employs VA to adjust token weights, and utilises many-to-many contrastive learning through a weighted contrastive loss. Together, these components enable our model to closely simulate the diagnostic process of professional radiologists. Consequently, our method achieves significant improvements of 0.185 on METEOR and 0.078 on ROUGE compared to previous approaches.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Editors: |
|
Copyright, Publisher and Additional Information: | This is an author produced version of a conference paper published in Medical Image Understanding and Analysis made available under the terms of the Creative Commons Attribution License (CC-BY), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 04 Jul 2025 15:23 |
Last Modified: | 21 Aug 2025 08:41 |
Published Version: | https://link.springer.com/chapter/10.1007/978-3-03... |
Status: | Published |
Publisher: | Springer Nature |
Series Name: | Lecture Notes in Computer Science |
Identification Number: | 10.1007/978-3-031-98688-8_14 |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:228682 |
Download
Filename: 1046-Paper_camera_ready_Jinlong Zhu and Ping Lu (1).pdf
Licence: CC-BY 4.0