How can we effectively expand the vocabulary of LLMs with 0.01GB of target language text?

This is the latest version of this eprint.

Yamaguchi, A. orcid.org/0000-0001-8327-7598, Villavicencio, A. and Aletras, N. (2025) How can we effectively expand the vocabulary of LLMs with 0.01GB of target language text? Computational Linguistics. ISSN: 0891-2017

Abstract

Metadata

Item Type: Article
Authors/Creators:
Copyright, Publisher and Additional Information:

© 2025 The Authors. Except as otherwise noted, this author-accepted version of a journal article published in Computational Linguistics is made available via the University of Sheffield Research Publications and Copyright Policy under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/

© 2025 Association for Computational Linguistics Published under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (https://creativecommons.org/licenses/by-nc-nd/4.0/) license

Dates:
  • Accepted: 24 October 2025
  • Published (online): 30 November 2025
  • Published: 30 November 2025
Institution: The University of Sheffield
Academic Units: The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield)
Funding Information:
Funder
Grant number
Engineering and Physical Sciences Research Council
2894795
Date Deposited: 07 Nov 2025 10:05
Last Modified: 02 Dec 2025 17:00
Status: Published online
Publisher: The MIT Press
Refereed: Yes
Identification Number: 10.1162/COLI.a.581
Related URLs:
Open Archives Initiative ID (OAI ID):

Available Versions of this Item

Downloads

Export

Statistics