2024 | September 11

New neural model for artificial intelligence in Basque

Orai has developed Llama-eus-8B: a new neural model for artificial intelligence systems that require the comprehension and generation of written Basque.
The model will be used to develop applications such as chatbots, machine translators, grammar checkers, search engines and content generation systems.

The research aiming to drive forward AI is forging ahead at great speed. Natural language processing poses unique obstacles for resource-poor languages, as they have neither the volume of text or data, nor the sufficient computational resources to advance at the pace of mainstream languages. So various strategies enabling the AI tools used in mainstream languages to be applied to Basque need to be sought, and in this quest the Basque research community is making great strides.

Orai NLP Teknologiak, Elhuyar's AI centre, has developed Llama-eus-8B, a new neural language model (LLM, Large Language Model) designed to facilitate the development of AI systems that require the comprehension and generation of written Basque. This is a foundational model, i.e. it is the type of model that is used as the basis for generative AI (or the well-known chatbots). Llama-eus-8B is the most advanced for the Basque language within the sphere of the foundational models regarded as light (less than 10 billion parameters).

Llama-eus-8B is distributed freely, which will facilitate the development and research of technologies in Basque in both academic and industrial environments. This model has been developed within the BasqueLLM research project, partly funded by the Chartered Provincial Council of Gipuzkoa through the Gipuzkoa Network Programme for Science, Technology and Innovation.

The link provided here will allow the user to download the model in question, accompanied by a technical explanation of its development and evaluation: https://huggingface.co/orai-nlp/Llama-eus-8B

Orai will use Llama-eus-8B as a base to develop applications for tasks such as the grammar correction of texts, content generation, creation of educational materials, search engines, chatbots and machine translation; all of them are tasks that require deep linguistic knowledge of the Basque language.

According to Xabier Saralegi, senior researcher in the BasqueLLM project, “we are currently experimenting with alternative training strategies in order to improve results without requiring larger collections of texts in Basque. Strategies to improve the transfer of skills learnt in English to Basque”.

Transferring the skills learnt from millions of texts in English to the Basque language

To develop Llama-eus-8B, the most recent model published by Meta, Llama3.1-8B, an open-source model with 8 billion parameters, was used as the base model. This neural language model was generated by means of machine learning algorithms using a huge collection of texts (15 billion words) mostly in English; it proved to be very efficient in English (and in other mainstream languages) when automating tasks requiring linguistic capabilities (machine translation, automatic summarisation, creative writing, dialogue systems, etc.). However, its performance in Basque is very limited.

Due to the dearth of large collections of texts in Basque and the major computing requirements needed to train a model of these characteristics for Basque from scratch, “we started from a solid base such as Llama3.1-8B. By using machine learning algorithms and a corpus of texts in Basque, the strategy involved transferring to Basque the skills learned from millions of texts in English”, explained Xabier Saralegi, Orai’s head of language technologies.

The ZelaiHandi corpus was used for this purpose; it was compiled by Orai a few months ago, and includes free licence, high quality content exclusively in Basque. ZelaiHandi is the largest, currently available, free licence dataset in Basque. To improve the transfer of skills between English and Basque, the ZelaiHandi texts were combined with texts in English. That way “we managed to ensure that the model maintained its knowledge in English while improving its comprehension of Basque by efficiently reusing what it had initially learnt in English”, added Orai researcher Ander Corral. The training of the model was carried out using the Hyperion system at the supercomputing centre of the Donostia International Physics Center (DIPC).

The model was evaluated on an extensive test set covering 11 tasks requiring not only formal language skills (correct use of grammar and vocabulary) but also functional ones (ability to understand and use the language in real contexts): school exams, problem solving, questionnaires on different subjects, analysis of opinions, etc.

The results of the evaluation indicate that Llama-eus-8B achieves the best performance among the currently available light foundational models (of less than 10 billion parameters), which makes it a valuable resource for the development of AI systems requiring language skills in Basque. In some tasks it offers competitive results compared with much larger models. In any case, although the results are getting closer and closer to those achieved in English, performance in Basque remains significantly lower than that of English.

(Image: Wes Cockx & Google DeepMind / Better Images of AI / AI large language models / CC-BY 4.0)

Neural language models.

New neural model for artificial intelligence in Basque

Orai has developed Llama-eus-8B: a new neural model for artificial intelligence systems that require the comprehension and generation of written Basque.

The model will be used to develop applications such as chatbots, machine translators, grammar checkers, search engines and content generation systems.

Transferring the skills learnt from millions of texts in English to the Basque language

Azken postak

A new chatbot in Basque that can be installed on in-house servers: Kimu

We are in the final phase of the SERMAS European project

Constantly improving speech recognition systems