Evalita-LLM, a new benchmark designed to evaluate Large Language Models (LLMs) on Italian tasks. The distinguishing and innovative features of Evalita-LLM are the following: (i) all tasks are native Italian, avoiding issues of translating from Italian and potential cultural biases; (ii) in addition to well established multiple-choice tasks, the benchmark includes generative tasks, enabling more natural interaction with LLMs; (iii) all tasks are evaluated against multiple prompts, this way mitigating the model sensitivity to specific prompts and allowing a fairer and objective evaluation.
@misc{magnini2025evalitallmbenchmarkinglargelanguage,
title={Evalita-LLM: Benchmarking Large Language Models on Italian},
author={Bernardo Magnini and Roberto Zanoli and Michele Resta and Martin Cimmino and Paolo Albano and Marco Madeddu and Viviana Patti},
year={2025},
eprint={2502.02289},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.02289},
}evalita-mp: All tasks (perplexity and non-perplexity based).evalita-mp_gen: Only generative tasks.evalita-mp_mc: Only perplexity-based tasks.
The following Evalita-LLM tasks can also be evaluated in isolation:
evalita-mp_te: Textual Entailmentevalita-mp_sa: Sentiment Analysisevalita-mp_wic: Word in Contextevalita-mp_hs: Hate Speech Detectionevalita-mp_at: Admission Testsevalita-mp_faq: FAQevalita-mp_sum_fp: Summarizationevalita-mp_ls: Lexical Substitutionevalita-mp_ner_group: Named Entity Recognitionevalita-mp_re: Relation Extraction
lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf --tasks evalita-mp --device cuda:0 --batch_size auto- v0.1: Renamed
evalita-mp_sum_fpsmall variant group toevalita-mp_sum_fp_smallto resolve duplicate group name conflict with the full variant.
- Is the task an existing benchmark in the literature?
- Have you referenced the original paper that introduced the task?
- If yes, does the original paper provide a reference implementation?
- Yes, original implementation contributed by author of the benchmark
If other tasks on this dataset are already supported:
- Is the "Main" variant of this task clearly denoted?
- Have you provided a short sentence in a README on what each new variant adds / evaluates?
- Have you noted which, if any, published evaluation setups are matched by this variant?