How can we effectively expand the vocabulary of LLMs with 0.01GB of target language text?

This is a preprint and may not have undergone formal peer review

Yamaguchi, A. orcid.org/0000-0001-8327-7598, Villavicencio, A. orcid.org/0000-0002-3731-9168 and Aletras, N. orcid.org/0000-0003-4285-1965 (Submitted: 2024) How can we effectively expand the vocabulary of LLMs with 0.01GB of target language text? [Preprint - arXiv] (Submitted)

Abstract

Metadata

Item Type: Preprint
Authors/Creators:
Copyright, Publisher and Additional Information:

© 2024 The Author(s). For reuse permissions, please contact the Author(s).

Keywords: Information and Computing Sciences; Language, Communication and Culture; Language Studies; Linguistics
Dates:
  • Submitted: 17 June 2024
Institution: The University of Sheffield
Academic Units: The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Funding Information:
Funder
Grant number
Engineering and Physical Sciences Research Council
2894795
Date Deposited: 04 Nov 2025 15:07
Last Modified: 04 Nov 2025 15:07
Status: Submitted
Identification Number: 10.48550/arxiv.2406.11477
Related URLs:
Open Archives Initiative ID (OAI ID):

Export

Statistics