Cao, J., Chen, J., Wang, X. et al. (5 more authors) (2026) UrbanMMCL: Urban region representations via multi-modal and multi-graph self-supervised contrastive learning. ISPRS Journal of Photogrammetry and Remote Sensing, 232. pp. 75-93. ISSN: 0924-2716
Abstract
Urban region representation learning has emerged as a fundamental approach for diverse urban analytics tasks, where each neighborhood is encoded as a dense embedding vector for effective downstream applications. However, existing approaches suffer from insufficient multi-modal alignment and inadequate spatial relationship modeling, limiting their representation quality and generalizability. To address these challenges, we propose UrbanMMCL, a novel self-supervised framework that integrates multi-modal multi-view contrastive pre-training with unified fine-tuning for comprehensive urban representation learning. UrbanMMCL employs a dual-stage architecture. First, cross-modal contrastive learning aligns diverse data modalities including remote sensing imagery, street view imagery, location encodings, and Vision–Language Model (VLM)-generated textual descriptions. Second, multi-view adaptive graph contrastive learning captures complex spatial relationships across human mobility, functional similarity, and geographic distance perspectives. The framework then fine-tunes all parameters with the learned representations for effective adaptation to downstream tasks. Comprehensive experiments demonstrate that UrbanMMCL consistently outperforms state-of-the-art methods across pollutant emission prediction, population density estimation, and land use classification with minimal fine-tuning requirements, thereby advancing foundation model development for diverse Geo-AI applications.
Metadata
| Item Type: | Article |
|---|---|
| Authors/Creators: |
|
| Copyright, Publisher and Additional Information: | This is an author produced version of an article published in ISPRS Journal of Photogrammetry and Remote Sensing, made available via the University of Leeds Research Outputs Policy under the terms of the Creative Commons Attribution License (CC-BY), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. |
| Keywords: | Urban region representation learning; Contrastive learning; Pretrain-finetune; Multimodal fusion; Urban foundation model |
| Dates: |
|
| Institution: | The University of Leeds |
| Academic Units: | The University of Leeds > Faculty of Environment (Leeds) > School of Geography (Leeds) |
| Date Deposited: | 15 Jan 2026 15:17 |
| Last Modified: | 15 Jan 2026 15:17 |
| Status: | Published |
| Publisher: | Elsevier |
| Identification Number: | 10.1016/j.isprsjprs.2025.11.012 |
| Related URLs: | |
| Sustainable Development Goals: | |
| Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:236100 |
Download
Filename: UrbanMMCL.pdf
Licence: CC-BY 4.0


CORE (COnnecting REpositories)
CORE (COnnecting REpositories)