Mitigating content effects on reasoning in language models through fine-grained activation steering

Valentino, M. orcid.org/0000-0002-9959-8385, Kim, G., Dalal, D. et al. (2 more authors) (2026) Mitigating content effects on reasoning in language models through fine-grained activation steering. In: Proceedings of the AAAI Conference on Artificial Intelligence. 40th AAAI Conference on Artificial Intelligence, 20-27 Jan 2026, Singapore. Vol. 40 (39). Association for the Advancement of Artificial Intelligence (AAAI), pp. 33314-33322. ISSN: 2159-5399. EISSN: 2374-3468.

Abstract

Large language models (LLMs) exhibit reasoning biases, often conflating content plausibility with formal logical validity. This can lead to wrong inferences in critical domains, where plausible arguments are incorrectly deemed logically valid or vice versa. This paper investigates how content biases on reasoning can be mitigated through activation steering, an inference-time technique that modulates internal activations. Specifically, after localising the layers responsible for formal and plausible inference, we investigate activation steering on a controlled syllogistic reasoning task, designed to disentangle formal validity from content plausibility. An extensive empirical analysis reveals that contrastive steering methods consistently support linear control over content biases. However, a static approach is insufficient to debias all the tested models. We then investigate how to control content effects by dynamically determining the steering parameters through fine-grained conditional methods. By introducing a novel kNN-based conditional approach (K-CAST), we demonstrate that conditional steering can effectively reduce biases on unresponsive models, achieving up to 15% absolute improvement in formal reasoning accuracy. Finally, we found that steering for content effects is robust to prompt variations, incurs minimal side effects on multilingual language modeling capabilities, and can partially generalize to different reasoning tasks. In practice, we demonstrate that activation-level interventions offer a scalable inference-time strategy for enhancing the robustness of LLMs, contributing towards more systematic and unbiased reasoning capabilities

Metadata

Item Type:	Proceedings Paper
Authors/Creators:	Valentino, M. https://orcid.org/0000-0002-9959-8385 Kim, G. Dalal, D. Zhao, Z. Freitas, A.
Copyright, Publisher and Additional Information:	© 2026 The Authors. Except as otherwise noted, this author-accepted version of a conference paper published in Proceedings of the AAAI Conference on Artificial Intelligence is made available via the University of Sheffield Research Publications and Copyright Policy under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/
Keywords:	Information and Computing Sciences; Artificial Intelligence
Dates:	Published (online): 14 March 2026 Published: 14 March 2026
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Date Deposited:	17 Apr 2026 10:19
Last Modified:	17 Apr 2026 13:16
Status:	Published
Publisher:	Association for the Advancement of Artificial Intelligence (AAAI)
Refereed:	Yes
Identification Number:	10.1609/aaai.v40i39.40617
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:240161

CORE (COnnecting REpositories)

Mitigating content effects on reasoning in language models through fine-grained activation steering

Abstract

Metadata

Download

Accepted Version

External copy

Export

Statistics