Adapting chat language models using only target unlabeled language data

This is the latest version of this eprint.

Abstract

Vocabulary expansion (VE) is the de-facto approach to language adaptation of large language models (LLMs) by adding new tokens and continuing pre-training on target data. While this is effective for base models trained on unlabeled data, it poses challenges for chat models trained to follow instructions through labeled conversation data. Directly adapting the latter with VE on target unlabeled data may result in forgetting chat abilities. While ideal, target chat data is often unavailable or costly to create for low-resource languages, and machine-translated alternatives are not always effective. To address this issue, previous work proposed using a base and chat model from the same family. This method first adapts the base LLM with VE on target unlabeled data and then converts it to a chat model by adding a chat vector (CV) derived from the weight difference between the source base and chat models. We propose ElChat, a new language adaptation method for chat LLMs that adapts a chat model directly on target unlabeled data, without a base model. It elicits chat abilities by injecting information from the source chat model. ElChat offers more robust and competitive target language and safety performance while achieving superior English, chat, and instruction-following abilities compared to CV.

Metadata

Item Type:	Article
Authors/Creators:	Yamaguchi, A. https://orcid.org/0000-0001-8327-7598 Morishita, T. Villavicencio, A. Aletras, N.
Copyright, Publisher and Additional Information:	© 2025 The Authors. This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Dates:	Accepted: 29 September 2025 Published (online): 12 October 2025 Published: 12 October 2025
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Funding Information:	Funder Grant number Engineering and Physical Sciences Research Council 2894795
Date Deposited:	04 Nov 2025 14:44
Last Modified:	04 Nov 2025 15:09
Published Version:	https://openreview.net/forum?id=6IdoIKowfe
Status:	Published
Publisher:	Journal of Machine Learning Research Inc.
Refereed:	Yes
Related URLs:	Dataset
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:233968

Available Versions of this Item

Adapting chat language models using only target unlabeled language data. (deposited 04 Nov 2025 14:34)
- Adapting chat language models using only target unlabeled language data. (deposited 04 Nov 2025 14:44) [Currently Displayed]

Download

Published Version

Filename: 4876_Adapting_Chat_Language_Mo.pdf

Licence: CC-BY 4.0

CLICK TO DOWNLOAD

[thumbnail of 4876_Adapting_Chat_Language_Mo.pdf]

CORE (COnnecting REpositories)

Adapting chat language models using only target unlabeled language data

Abstract

Metadata

Available Versions of this Item

Download

Published Version

Export

Statistics

Adapting chat language models using only target unlabeled language data

Abstract

Metadata

Available Versions of this Item

Download

Published Version

Related datasets

Export

Statistics