Unified policy value decomposition for rapid adaptation

This is a preprint and may not have undergone formal peer review

Abstract

Rapid adaptation in complex control systems remains a central challenge in reinforcement learning. We introduce a framework in which policy and value functions share a low-dimensional coefficient vector - a goal embedding - that captures task identity and enables immediate adaptation to novel tasks without retraining representations. During pretraining, we jointly learn structured value bases and compatible policy bases through a bilinear actor-critic decomposition. The critic factorizes as Q = sum_k G_k(g) y_k(s,a), where G_k(g) is a goal-conditioned coefficient vector and y_k(s,a) are learned value basis functions. This multiplicative gating - where a context signal scales a set of state-dependent bases - is reminiscent of gain modulation observed in Layer 5 pyramidal neurons, where top-down inputs modulate the gain of sensory-driven responses without altering their tuning. Building on Successor Features, we extend the decomposition to the actor, which composes a set of primitive policies weighted by the same coefficients G_k(g). At test time the bases are frozen and G_k(g) is estimated zero-shot via a single forward pass, enabling immediate adaptation to novel tasks without any gradient update. We train a Soft Actor-Critic agent on the MuJoCo Ant environment under a multi-directional locomotion objective, requiring the agent to walk in eight directions specified as continuous goal vectors. The bilinear structure allows each policy head to specialize to a subset of directions, while the shared coefficient layer generalizes across them, accommodating novel directions by interpolating in goal embedding space. Our results suggest that shared low-dimensional goal embeddings offer a general mechanism for rapid, structured adaptation in high-dimensional control, and highlight a potentially biologically plausible principle for efficient transfer in complex reinforcement learning systems.

Metadata

Item Type:	Preprint
Authors/Creators:	Capone, C. Falorsi, L. Ciardiello, A. Manneschi, L.
Copyright, Publisher and Additional Information:	© 2026 The Author(s). For reuse permissions, please contact the Author(s).
Dates:	Submitted: 18 March 2026
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Date Deposited:	11 May 2026 09:29
Last Modified:	11 May 2026 09:29
Status:	Submitted
Identification Number:	10.48550/arXiv.2603.17947
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:240963

CORE (COnnecting REpositories)

Unified policy value decomposition for rapid adaptation

Abstract

Metadata

Download

Preprint

Export

Statistics