Dettaglio pubblicazione

2025, Findings of the Association for Computational Linguistics: {NAACL}2025, Albuquerque, New Mexico, USA, April 29 - May 4, 2025, Pages 6646-6660

Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation (04b Atto di convegno in volume)

Moroni Luca, Puccetti Giovanni, HUGUET CABOT PERE-LLUIS, Bejgu ANDREI STEFAN, Miaschi Alessio, Barba Edoardo, Dell'Orletta Felice, Esuli Andrea, Navigli Roberto

The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token "fertility") and slower inference speed. In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7b-v0.1, reducing token fertility by 25%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.

keywords