Dishonesty in Helpful and Harmless Alignment

This is a preprint and may not have undergone formal peer review

Abstract

People tell lies when seeking rewards. Large language models (LLMs) are aligned to human values with reinforcement learning where they get rewards if they satisfy human preference. We find that this also induces dishonesty in helpful and harmless alignment where LLMs tell lies in generating harmless responses. Using the latest interpreting tools, we detect dishonesty, show how LLMs can be harmful if their honesty is increased, and analyze such conflicts at the parameter-level. Given these preliminaries and the hypothesis that reward-seeking stimulates dishonesty, we theoretically show that the dishonesty can in-turn decrease the alignment performances and augment reward-seeking alignment with representation regularization. Extensive results, including GPT-4 annotated win-rates, perplexities, and cases studies demonstrate that we can train more honest, helpful, and harmless LLMs. We will make all our codes and results be open-sourced upon this paper's acceptance.

Metadata

Item Type:	Preprint
Authors/Creators:	Huang, Y. Tang, J. Feng, D. Zhang, Z. Lei, W. Lv, J. Cohn, A.G. https://orcid.org/0000-0002-7652-8907
Copyright, Publisher and Additional Information:	This item is protected by copyright. This is an open access preprint under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike Licence (CC BY-NC-SA 4.0).
Dates:	Published: 4 June 2024
Institution:	The University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) > Artificial Intelligence The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds)
Funding Information:	Funder Grant number Alan Turing Institute Not Known
Depositing User:	Symplectic Publications
Date Deposited:	14 Aug 2024 11:04
Last Modified:	14 Aug 2024 11:04
Identification Number:	10.48550/arXiv.2406.01931
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:216121

Download

Preprint

Filename: 2406.01931v2.pdf

Licence: CC-BY-SA 4.0

CLICK TO DOWNLOAD

External copy

https://doi.org/10.48550/arXiv.2406.01931

CORE (COnnecting REpositories)