Skip to content

Latest commit

 

History

History
66 lines (49 loc) · 2.58 KB

File metadata and controls

66 lines (49 loc) · 2.58 KB

Evalita-LLM

Paper

Evalita-LLM, a new benchmark designed to evaluate Large Language Models (LLMs) on Italian tasks. The distinguishing and innovative features of Evalita-LLM are the following: (i) all tasks are native Italian, avoiding issues of translating from Italian and potential cultural biases; (ii) in addition to well established multiple-choice tasks, the benchmark includes generative tasks, enabling more natural interaction with LLMs; (iii) all tasks are evaluated against multiple prompts, this way mitigating the model sensitivity to specific prompts and allowing a fairer and objective evaluation.

Citation

@misc{magnini2025evalitallmbenchmarkinglargelanguage,
      title={Evalita-LLM: Benchmarking Large Language Models on Italian},
      author={Bernardo Magnini and Roberto Zanoli and Michele Resta and Martin Cimmino and Paolo Albano and Marco Madeddu and Viviana Patti},
      year={2025},
      eprint={2502.02289},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.02289},
}

Groups

  • evalita-mp: All tasks (perplexity and non-perplexity based).
  • evalita-mp_gen: Only generative tasks.
  • evalita-mp_mc: Only perplexity-based tasks.

Tasks

The following Evalita-LLM tasks can also be evaluated in isolation:

  • evalita-mp_te: Textual Entailment
  • evalita-mp_sa: Sentiment Analysis
  • evalita-mp_wic: Word in Context
  • evalita-mp_hs: Hate Speech Detection
  • evalita-mp_at: Admission Tests
  • evalita-mp_faq: FAQ
  • evalita-mp_sum_fp: Summarization
  • evalita-mp_ls: Lexical Substitution
  • evalita-mp_ner_group: Named Entity Recognition
  • evalita-mp_re: Relation Extraction

Usage

lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf --tasks evalita-mp --device cuda:0 --batch_size auto

Changelog

  • v0.1: Renamed evalita-mp_sum_fp small variant group to evalita-mp_sum_fp_small to resolve duplicate group name conflict with the full variant.

Checklist

  • Is the task an existing benchmark in the literature?
    • Have you referenced the original paper that introduced the task?
    • If yes, does the original paper provide a reference implementation?
      • Yes, original implementation contributed by author of the benchmark

If other tasks on this dataset are already supported:

  • Is the "Main" variant of this task clearly denoted?
  • Have you provided a short sentence in a README on what each new variant adds / evaluates?
  • Have you noted which, if any, published evaluation setups are matched by this variant?