Capone, C., Falorsi, L., Ciardiello, A. et al. (1 more author) (Submitted: 2026) Unified policy value decomposition for rapid adaptation. [Preprint - arXiv] (Submitted)
Abstract
Rapid adaptation in complex control systems remains a central challenge in reinforcement learning. We introduce a framework in which policy and value functions share a low-dimensional coefficient vector - a goal embedding - that captures task identity and enables immediate adaptation to novel tasks without retraining representations. During pretraining, we jointly learn structured value bases and compatible policy bases through a bilinear actor-critic decomposition. The critic factorizes as Q = sum_k G_k(g) y_k(s,a), where G_k(g) is a goal-conditioned coefficient vector and y_k(s,a) are learned value basis functions. This multiplicative gating - where a context signal scales a set of state-dependent bases - is reminiscent of gain modulation observed in Layer 5 pyramidal neurons, where top-down inputs modulate the gain of sensory-driven responses without altering their tuning. Building on Successor Features, we extend the decomposition to the actor, which composes a set of primitive policies weighted by the same coefficients G_k(g). At test time the bases are frozen and G_k(g) is estimated zero-shot via a single forward pass, enabling immediate adaptation to novel tasks without any gradient update. We train a Soft Actor-Critic agent on the MuJoCo Ant environment under a multi-directional locomotion objective, requiring the agent to walk in eight directions specified as continuous goal vectors. The bilinear structure allows each policy head to specialize to a subset of directions, while the shared coefficient layer generalizes across them, accommodating novel directions by interpolating in goal embedding space. Our results suggest that shared low-dimensional goal embeddings offer a general mechanism for rapid, structured adaptation in high-dimensional control, and highlight a potentially biologically plausible principle for efficient transfer in complex reinforcement learning systems.
Metadata
| Item Type: | Preprint |
|---|---|
| Authors/Creators: |
|
| Copyright, Publisher and Additional Information: | © 2026 The Author(s). For reuse permissions, please contact the Author(s). |
| Dates: |
|
| Institution: | The University of Sheffield |
| Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
| Date Deposited: | 11 May 2026 09:29 |
| Last Modified: | 11 May 2026 09:29 |
| Status: | Submitted |
| Identification Number: | 10.48550/arXiv.2603.17947 |
| Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:240963 |
Download
Filename: 2603.17947v1.pdf
CORE (COnnecting REpositories)
CORE (COnnecting REpositories)