Dishonesty in Helpful and Harmless Alignment

This is a preprint and may not have undergone formal peer review

Huang, Y., Tang, J., Feng, D. et al. (4 more authors) (2024) Dishonesty in Helpful and Harmless Alignment. [Preprint - arXiv CoRR]

Abstract

Metadata

Item Type: Preprint
Authors/Creators:
Copyright, Publisher and Additional Information:

This item is protected by copyright. This is an open access preprint under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike Licence (CC BY-NC-SA 4.0).

Dates:
  • Published: 4 June 2024
Institution: The University of Leeds
Academic Units: The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds) > Artificial Intelligence
The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds)
Funding Information:
Funder
Grant number
Alan Turing Institute
Not Known
Depositing User: Symplectic Publications
Date Deposited: 14 Aug 2024 11:04
Last Modified: 14 Aug 2024 11:04
Identification Number: 10.48550/arXiv.2406.01931
Open Archives Initiative ID (OAI ID):

Export

Statistics