Exploring vision language models for multimodal and multilingual stance detection.

This is a preprint and may not have undergone formal peer review

Abstract

Social media's global reach amplifies the spread of information, highlighting the need for robust Natural Language Processing tasks like stance detection across languages and modalities. Prior research predominantly focuses on text-only inputs, leaving multimodal scenarios, such as those involving both images and text, relatively underexplored. Meanwhile, the prevalence of multimodal posts has increased significantly in recent years. Although state-of-the-art Vision-Language Models (VLMs) show promise, their performance on multimodal and multilingual stance detection tasks remains largely unexamined. This paper evaluates state-of-the-art VLMs on a newly extended dataset covering seven languages and multimodal inputs, investigating their use of visual cues, language-specific performance, and cross-modality interactions. Our results show that VLMs generally rely more on text than images for stance detection and this trend persists across languages. Additionally, VLMs rely significantly more on text contained within the images than other visual content. Regarding multilinguality, the models studied tend to generate consistent predictions across languages whether they are explicitly multilingual or not, although there are outliers that are incongruous with macro F1, language support, and model size.

Metadata

Item Type:	Preprint
Authors/Creators:	Vasilakes, J. Scarton, C. Zhao, Z. https://orcid.org/0000-0002-3060-269X
Copyright, Publisher and Additional Information:	© 2025 The Author(s). This preprint is made available under a Creative Commons Attribution-ShareAlike 4.0 International License. (http://creativecommons.org/licenses/by-sa/4.0/)
Dates:	Submitted: 29 January 2025
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Funding Information:	Funder Grant number EUROPEAN MEDIA AND INFORMATION FUND UNSPECIFIED
Depositing User:	Symplectic Sheffield
Date Deposited:	14 Mar 2025 11:57
Last Modified:	14 Mar 2025 11:57
Status:	Submitted
Identification Number:	10.48550/arXiv.2501.17654
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:224408

Download

Preprint

Filename: 2501.17654v1.pdf

Licence: CC-BY-SA 4.0

CLICK TO DOWNLOAD

CORE (COnnecting REpositories)

Exploring vision language models for multimodal and multilingual stance detection.

Abstract

Metadata

Download

Preprint

Export

Statistics