FLYPE: Multitask prompt tuning for multimodal human understanding of social media

Peng, B., Wu, C., He, W. orcid.org/0009-0000-8005-062X et al. (4 more authors) (2023) FLYPE: Multitask prompt tuning for multimodal human understanding of social media. In: Cheema, G.S., Hakimov, S., Kastner, M.A. and Garcia, N., (eds.) MUWS 2023: Multimodal Human Understanding for the Web and Social Media 2023: Proceedings of the 2nd International Workshop on Multimodal Human Understanding for the Web and Social Media co-located with the 32nd ACM International Conference on Information. 2nd International Workshop on Multimodal Human Understanding for the Web and Social Media co-located with the 32nd ACM International Conference on Information and Knowledge Management (CIKM 2023), 22 Oct 2023, Birmingham, United Kingdom. CEUR Workshop Proceedings, 3566 . CEUR Workshop Proceedings , pp. 18-33.

Abstract

Large-scale pretraining and instruction tuning has facilitated visual language understanding on general purposes with broad competence. Social media processing will be highly benefit from large visual language models because messages are conveyed through joint reasoning over texts and images. Although vision-language pretraining has been widely studied, meta vision-language tuning remains under-explored. Given the ubiquity of visual content in social media, adapting pretrained Visual Language Models (VLMs) to meta social science is essential avoid extra computational expense on hyper-parameter search. This paper takes inspiration from cognitive studies to intrinsically and efficiently integrate a cross-modal reasoning into a method named FLYPE, as the runner up winner in CheckThat! 2023 task 1A. FLYPE integrates visual and text components of multiple tasks with cross-task shared prompts to guide a frozen VLM to perform as a meta classifier for unseen tasks. We evaluate our model across six social visual language understanding tasks and perform an ablation study on several modifications to the architecture. Our empirical study shows the competitive performance and training efficiency of the method. Using soft prompts can curate biased pretrained attention to focus on more task-related visual content. We release improved benchmarks with our model at https://github.com/pengbohua/Flype.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Peng, B. Wu, C. He, W. https://orcid.org/0009-0000-8005-062X Thorne, W. https://orcid.org/0000-0002-8947-6261 Villavicencio, A. https://orcid.org/0000-0002-3731-9168 Wang, Y. Paes, A.
Editors:	Cheema, G.S. Hakimov, S. Kastner, M.A. Garcia, N.
Copyright, Publisher and Additional Information:	© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). (https://creativecommons.org/licenses/by/4.0)
Keywords:	Large visual language model; parameter efficient training; matrix decomposition; bias analysis
Dates:	Published: 1 January 2023
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Depositing User:	Symplectic Sheffield
Date Deposited:	23 May 2025 10:09
Last Modified:	25 May 2025 08:09
Status:	Published
Publisher:	CEUR Workshop Proceedings
Series Name:	CEUR Workshop Proceedings
Refereed:	Yes
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:227046

Download

Published Version

Filename: flype.pdf

Licence: CC-BY 4.0

CLICK TO DOWNLOAD

CORE (COnnecting REpositories)

FLYPE: Multitask prompt tuning for multimodal human understanding of social media

Abstract

Metadata

Download

Published Version

Export

Statistics