How can we effectively expand the vocabulary of LLMs with 0.01GB of target language text?

There is a more recent version of this eprint available. Click here to view it.

This is a preprint and may not have undergone formal peer review

Abstract

Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers and vocabulary, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, previous work on vocabulary expansion has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion in low-resource settings has yet to be explored. In this article, we investigate vocabulary expansion in low-resource settings by considering embedding initialization methods and continual pre-training strategies. Through extensive experiments across typologically diverse languages, tasks and models, we establish a set of strategies to perform vocabulary expansion for faster inference, while striving to maintain competitive downstream performance to baselines. This is achieved with only 30K sentences (0.01GB text data) from the target language.

Metadata

Item Type:	Preprint
Authors/Creators:	Yamaguchi, A. https://orcid.org/0000-0001-8327-7598 Villavicencio, A. https://orcid.org/0000-0002-3731-9168 Aletras, N. https://orcid.org/0000-0003-4285-1965
Copyright, Publisher and Additional Information:	© 2024 The Author(s). For reuse permissions, please contact the Author(s).
Keywords:	Information and Computing Sciences; Language, Communication and Culture; Language Studies; Linguistics
Dates:	Submitted: 17 June 2024
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Funding Information:	Funder Grant number Engineering and Physical Sciences Research Council 2894795
Date Deposited:	04 Nov 2025 15:07
Last Modified:	04 Nov 2025 15:07
Status:	Submitted
Identification Number:	10.48550/arxiv.2406.11477
Related URLs:	arXiv URL
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:233866

Available Versions of this Item

How can we effectively expand the vocabulary of LLMs with 0.01GB of target language text? (deposited 04 Nov 2025 15:07) [Currently Displayed]
- How can we effectively expand the vocabulary of LLMs with 0.01GB of target language text? (deposited 07 Nov 2025 10:05)

CORE (COnnecting REpositories)

How can we effectively expand the vocabulary of LLMs with 0.01GB of target language text?

Abstract

Metadata

Available Versions of this Item

Download

Preprint

Export

Statistics