Zero-shot urban function inference with street view images through prompting a pretrained vision-language model

Abstract

Inferring urban functions using street view images (SVIs) has gained tremendous momentum. The recent prosperity of large-scale vision-language pretrained models sheds light on addressing some long-standing challenges in this regard, for example, heavy reliance on labeled samples and computing resources. In this paper, we present a novel prompting framework for enabling the pretrained vision-language model CLIP to effectively infer fine-grained urban functions with SVIs in a zero-shot manner, that is, without labeled samples and model training. The prompting framework UrbanCLIP comprises an urban taxonomy and several urban function prompt templates, in order to (1) bridge the abstract urban function categories and concrete urban object types that can be readily understood by CLIP, and (2) mitigate the interference in SVIs, for example, street-side trees and vehicles. We conduct extensive experiments to verify the effectiveness of UrbanCLIP. The results indicate that the zero-shot UrbanCLIP largely surpasses several competitive supervised baselines, e.g. a fine-tuned ResNet, and its advantages become more prominent in cross-city transfer tests. In addition, UrbanCLIP’s zero-shot performance is considerably better than the vanilla CLIP. Overall, UrbanCLIP is a simple yet effective framework for urban function inference, and showcases the potential of foundation models for geospatial applications.

Metadata

Item Type:	Article
Authors/Creators:	Huang, W. Wang, J. Cong, G.
Copyright, Publisher and Additional Information:	© 2024 Informa UK Limited, trading as Taylor & Francis Group. This is an author produced version of an article published in International Journal of Geographical Information Science. Uploaded in accordance with the publisher's self-archiving policy.
Keywords:	Urban land use, prompt engineering, CLIP, foundation model, street view image
Dates:	Accepted: 21 April 2024 Published (online): 22 May 2024 Published: 2 July 2024
Institution:	The University of Leeds
Academic Units:	The University of Leeds > Faculty of Environment (Leeds) > School of Geography (Leeds)
Depositing User:	Symplectic Publications
Date Deposited:	13 Mar 2025 13:26
Last Modified:	22 May 2025 00:30
Status:	Published
Publisher:	Taylor & Francis
Identification Number:	10.1080/13658816.2024.2347322
Related URLs:	Dataset
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:224347

CORE (COnnecting REpositories)

Zero-shot urban function inference with street view images through prompting a pretrained vision-language model

Abstract

Metadata

Download

Accepted Version

Export

Statistics

Zero-shot urban function inference with street view images through prompting a pretrained vision-language model

Abstract

Metadata

Download

Accepted Version

Related datasets

Export

Statistics