Beyond text: leveraging vision-language models for misinformation detection

Grewal, P.K., Ernst, M. and Hopfgartner, F. orcid.org/0000-0003-0380-6088 (2025) Beyond text: leveraging vision-language models for misinformation detection. In: DHOW '25: Proceedings of the 2nd International Workshop on Diffusion of Harmful Content on Online Web. 2nd International Workshop on Diffusion of Harmful Content on Online Web, 27-28 Oct 2025, Dublin, Ireland. ACM, pp. 75-83. ISBN: 9798400720574.

Abstract

The swift expansion of social media and digital platforms has fueled the spread of misinformation, including disinformation and propaganda intentionally designed to deceive the public and shape public opinions on critical issues. As any other information nowadays, malicious content is not limited to text only, but is enriched with different types of multimedia, including images, videos, etc. This diversity poses significant challenges when it comes to detection, since standard techniques that treat each data type separately are usually ineffective in addressing multimodal scenarios. Addressing this challenge requires advanced detection methods capable of analyzing multimodal content, such as text and images. This study explores the effectiveness of advanced multimodal frameworks, including Google’s CLIP, ViLT, and FLAVA, in detecting misinformation using the Fakeddit dataset, a widely used benchmark for multimodal research. Leveraging pre-trained models, this research evaluates the performance of these methods. In addition to textual and visual inputs, we incorporate metadata features to enhance models’ performance. The results demonstrate these models’ potential to enhance the robustness and accuracy of misinformation detection, thereby countering the increasing sophistication of multimodal disinformation campaigns.

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Grewal, P.K. Ernst, M. Hopfgartner, F. https://orcid.org/0000-0003-0380-6088
Copyright, Publisher and Additional Information:	© 2025 Copyright held by the owner/author(s). This work is licensed under a Creative Commons Attribution 4.0 International License. https://creativecommons.org/licenses/by/4.0
Keywords:	Multimodal misinformation detection; multimodal LLMs; Disinformation; Misinformation; Vision-Language models; Metadata; Fake news
Dates:	Accepted: 13 August 2025 Published (online): 26 October 2025 Published: 26 October 2025
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield)
Date Deposited:	05 Sep 2025 13:59
Last Modified:	27 Oct 2025 10:05
Status:	Published
Publisher:	ACM
Refereed:	Yes
Identification Number:	10.1145/3746275.3762205
Related URLs:	Conference
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:230820

Download

Published Version

Filename: main.pdf

Licence: CC-BY 4.0

CLICK TO DOWNLOAD

CORE (COnnecting REpositories)

Beyond text: leveraging vision-language models for misinformation detection

Abstract

Metadata

Download

Published Version

Export

Statistics