Share

First prize in the EvalLLM text-based information extraction challenge

Figure 1 : A sample of text, from the challenge, submitted by the French Ministerial Agency for AI and Defense (Source https://evalllm2024. sciencesconf.org from https://cf2r.org/wpcontent/uploads/2021/10/ Renseignor1203.pdf)
The CEA won the EvalLLM 2024 challenge, organized by the French Ministerial Agency for AI and Defense (AMIAD) in May 2024. The challenge was to extract information from texts written in French using limited training data. Our researchers demonstrated that it is possible for encoder-type language models of limited size to significantly outperform very large generative language models.

The focus of the EvalLLM 2024 challenge was to evaluate neural-network-based models for named-entity recognition (NER), a cornerstone task of information extraction. This evaluation was carried out in a few-shot setting (in French, where there is much less available annotated data than in English). In real-world information extraction, it is very rare to have enough annotated data to adapt traditional supervised learning models. For this reason, participants only had four newsletters and a blog post to work with.

The most pressing issue is whether large generative language models are better than much smaller encoder-type language models for few-shot environments of this kind. To find out, the CEA researchers used two models: GoLLIE (Sainz et al., 2024) and GLiNER (Zaratiana et al., 2024). GoLLIE, based on the Code-LLaMA 13B generative model, turns the named-entity recognition task into a code-generation task via prompts similar to Figure 2:

 


Figure 2 : Prompt for the GoLLIE model


 


GLiNER, on the other hand, based on BERT-type encoder models, learns to map representations of candidate entity mentions to representations of possible entity types, which are used as prompts, as seen in Figure 3.
 

Figure 3 : GLiNER model


 

Both models are pre-trained on a large scale using English-language datasets unrelated to the target domain of the evaluation. For GoLLIE the training data was manually annotated; for GLiNER it was automatically generated using ChatGPT.

GLiNER clearly outperformed GoLLIE on the evaluation task, with a gain of 62% on the macro-F1 evaluation and 118.6% on the micro-F1 evaluation.

It also outperformed the—much larger—GPT-4o model used by the team that came in second in the challenge (Figure 4).

 

Figure 4: Evaluation results (performance increases with the value of both indicators)


 

As part of EU-funded projects in progress, we are investigating how to apply the results obtained for this competition to safety data. And, as part of the French national AI initiative’s Sharp project on frugal AI, we are looking for ways to improve the few-shot performance of the GLiNER model.

TIAM (Text-Image Alignment Metric): measuring visual generative AI performance

When it comes to AI image generation, impressive results can be achieved with text-conditional models like Stable Diffusion. However, their capacity to take into account text-based instructions is not always evaluated precisely. We developed a new metric (the Text-Image Alignment Metric, or TIAM) that is based on two key elements: the controlled generation of text-based prompts and image analysis.

The goal is to determine how well a model adheres to the number of objects or color specified. We studied the six most commonly-used existing AI models to demonstrate their still-limited ability to follow a prompt that involves more than one object. This ability is even more limited when the prompt includes a color.TIAM is a useful tool for studying the influence of AI training parameters on the comprehension of and compliance with prompts. It also paves the way toward new research on noise mining—techniques to identify the “right” noises with a view to improving AI output.

Learn more

Major projects

The CEA is contributing its research on few-shot namedentity recognition models to the EU VANGUARD, ARIEN, and STARLIGHT projects, which address the field of security. Ultimately, these models could be used to scan text on social media for signs of criminal activity.

Flagship publication

CEA-List@EvalLLM2024: prompter un très grand modèle de langue ou affiner un plus petit ?” Robin Armingaud, Arthur Peuvot, Romaric Besançon, Olivier Ferret, Sondes Souihi and Julien Tourille EvalLLM2024 : Atelier sur l’évaluation des modèles génératifs (LLM) et challenge d’extraction d’information few-shot, Toulouse, France

Contributors to this article:

  • Robin Armingaud, Research engineer
  • Arthur Peuvot, Research engineer
  • Romaric BESANÇON, Research engineer, Expert
  • Olivier FERRET, Research Director, Fellow
  • Sondes SOUIHI, Research Engineer
  • Julien TOURILLE, Research engineer

See also

Research programs

Responsible AI

CEA-List is rolling out an ambitious research program to support the responsible development of AI-based systems for industry and society.
Read more