Training AI to handle knowledge about the Basque Country
Large Language Models (LLMs) have radically changed language processing, and their capacity to produce and understand text is huge. However, integrating languages with limited resources and their cultures remains a major challenge, as is the case of the Basque language (Euskara) and Basque culture. When asked, for example, about the impact that the dictator Franco had on Basque, the best-known generative language models assert that the dictator made the language official, when in fact Basque was severely persecuted. They also assert that the Basque writer Itxaro Borda was born in Pau instead of Bayonne.
In her Master's dissertation submitted at the University of the Basque Country (UPV/EHU), Oihane Cantero, a researcher at Orai, analysed various methodologies for integrating knowledge about the Basque Country (Euskal Herria) into language models.
Cantero's work aimed to provide language models with knowledge about the Basque Country and evaluate it. So she created a dataset made up of multiple-choice questions to evaluate factual knowledge about the Basque Country (called EHQA) and proposed a semi-automatic methodology to create this type of dataset. Knowledge about the Basque Country was incorporated into the LLMs using various techniques, such as continual pre-training, knowledge editing, and RAG (Retrieval-Augmented Generation).
The results revealed a remarkable improvement in the model's ability to generate and understand knowledge in Basque: accuracy increased from 33% to 88% with the editing techniques, and to 71% with RAG. In this work, Cantero managed to provide the language models with factual knowledge about the Basque Country, but with certain limitations: “Editing techniques produce some side effects that can negatively affect other capabilities of the model. Additionally, with RAG, the knowledge is not integrated into the model itself, limiting its use to a few tasks, such as answering questions” the Orai researcher pointed out.
Orai has made the dataset produced in Cantero's Master's dissertation available to the research community. This is a dataset designed to test knowledge about the Basque Country and which can be used by the scientific community to drive forward the integration of cultures with limited resources into large language models: https://huggingface.co/datasets/orai-nlp/EHQA
Oihane Cantero was awarded the top mark for her Master's dissertation. Her tutors were Zuhaitz Beloki (Orai), Xabier Saralegi (Orai), and Gorka Azkune from the University of the Basque Country (UPV/EHU).