Huang, W., Wang, J. and Cong, G. (2024) Zero-shot urban function inference with street view images through prompting a pretrained vision-language model. International Journal of Geographical Information Science, 38 (7). pp. 1414-1442. ISSN 1365-8816
Abstract
Inferring urban functions using street view images (SVIs) has gained tremendous momentum. The recent prosperity of large-scale vision-language pretrained models sheds light on addressing some long-standing challenges in this regard, for example, heavy reliance on labeled samples and computing resources. In this paper, we present a novel prompting framework for enabling the pretrained vision-language model CLIP to effectively infer fine-grained urban functions with SVIs in a zero-shot manner, that is, without labeled samples and model training. The prompting framework UrbanCLIP comprises an urban taxonomy and several urban function prompt templates, in order to (1) bridge the abstract urban function categories and concrete urban object types that can be readily understood by CLIP, and (2) mitigate the interference in SVIs, for example, street-side trees and vehicles. We conduct extensive experiments to verify the effectiveness of UrbanCLIP. The results indicate that the zero-shot UrbanCLIP largely surpasses several competitive supervised baselines, e.g. a fine-tuned ResNet, and its advantages become more prominent in cross-city transfer tests. In addition, UrbanCLIP’s zero-shot performance is considerably better than the vanilla CLIP. Overall, UrbanCLIP is a simple yet effective framework for urban function inference, and showcases the potential of foundation models for geospatial applications.
Metadata
Item Type: | Article |
---|---|
Authors/Creators: |
|
Copyright, Publisher and Additional Information: | © 2024 Informa UK Limited, trading as Taylor & Francis Group. This is an author produced version of an article published in International Journal of Geographical Information Science. Uploaded in accordance with the publisher's self-archiving policy. |
Keywords: | Urban land use, prompt engineering, CLIP, foundation model, street view image |
Dates: |
|
Institution: | The University of Leeds |
Academic Units: | The University of Leeds > Faculty of Environment (Leeds) > School of Geography (Leeds) |
Depositing User: | Symplectic Publications |
Date Deposited: | 13 Mar 2025 13:26 |
Last Modified: | 13 Mar 2025 13:26 |
Status: | Published |
Publisher: | Taylor & Francis |
Identification Number: | 10.1080/13658816.2024.2347322 |
Related URLs: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:224347 |
Download
Filename: UrbanCLIP_IJGIS.pdf
