Peng, B., Wu, C., He, W. orcid.org/0009-0000-8005-062X et al. (4 more authors) (2023) FLYPE: Multitask prompt tuning for multimodal human understanding of social media. In: Cheema, G.S., Hakimov, S., Kastner, M.A. and Garcia, N., (eds.) MUWS 2023: Multimodal Human Understanding for the Web and Social Media 2023: Proceedings of the 2nd International Workshop on Multimodal Human Understanding for the Web and Social Media co-located with the 32nd ACM International Conference on Information. 2nd International Workshop on Multimodal Human Understanding for the Web and Social Media co-located with the 32nd ACM International Conference on Information and Knowledge Management (CIKM 2023), 22 Oct 2023, Birmingham, United Kingdom. CEUR Workshop Proceedings, 3566 . CEUR Workshop Proceedings , pp. 18-33.
Abstract
Large-scale pretraining and instruction tuning has facilitated visual language understanding on general purposes with broad competence. Social media processing will be highly benefit from large visual language models because messages are conveyed through joint reasoning over texts and images. Although vision-language pretraining has been widely studied, meta vision-language tuning remains under-explored. Given the ubiquity of visual content in social media, adapting pretrained Visual Language Models (VLMs) to meta social science is essential avoid extra computational expense on hyper-parameter search. This paper takes inspiration from cognitive studies to intrinsically and efficiently integrate a cross-modal reasoning into a method named FLYPE, as the runner up winner in CheckThat! 2023 task 1A. FLYPE integrates visual and text components of multiple tasks with cross-task shared prompts to guide a frozen VLM to perform as a meta classifier for unseen tasks. We evaluate our model across six social visual language understanding tasks and perform an ablation study on several modifications to the architecture. Our empirical study shows the competitive performance and training efficiency of the method. Using soft prompts can curate biased pretrained attention to focus on more task-related visual content. We release improved benchmarks with our model at https://github.com/pengbohua/Flype.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Editors: |
|
Copyright, Publisher and Additional Information: | © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). (https://creativecommons.org/licenses/by/4.0) |
Keywords: | Large visual language model; parameter efficient training; matrix decomposition; bias analysis |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 23 May 2025 10:09 |
Last Modified: | 25 May 2025 08:09 |
Status: | Published |
Publisher: | CEUR Workshop Proceedings |
Series Name: | CEUR Workshop Proceedings |
Refereed: | Yes |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:227046 |