If you’ve been following Arabic LLM evaluation, you’ve probably noticed a pattern: more benchmarks pop up every month, but nobody’s really checking whether they measure what they claim to. The tension has been building for a while.
A team from TII (Technology Innovation Institute) decided to do something about it. They built QIMMA (قمّة, Arabic for “summit”), and what they found is sobering: even the most popular Arabic benchmarks have systematic quality issues that quietly corrupt evaluation results.
The Fragmented Mess of Arabic NLP Evaluation
Arabic is spoken by over 400 million people across dozens of dialects and cultural contexts. You’d think the evaluation landscape would be mature by now. It’s not.
Three big problems stand out:
Translation artifacts. A lot of Arabic benchmarks are just English benchmarks translated over. Questions that work fine in English become awkward or culturally nonsensical in Arabic. The distribution shifts, and suddenly you’re measuring how well a model handles bad translations, not how well it understands Arabic.
No quality checks. Even native Arabic benchmarks often ship without rigorous validation. Annotation inconsistencies, wrong gold answers, encoding errors, cultural bias in labels — all documented in established resources. It’s like publishing a math test without checking if the answer key is correct.
Reproducibility? What reproducibility? Evaluation scripts and per-sample outputs rarely get released publicly. Good luck auditing results or building on someone else’s work.
Existing leaderboards cover isolated tasks and narrow domains. OALL v1 and v2? Mixed language sources, no quality validation. BALSAM? Half native Arabic, no validation. AraGen? All native, but no validation either. You get the picture.
QIMMA is the only platform that combines five things: open source, 99% native Arabic content, systematic quality validation, code evaluation, and public per-sample outputs. That’s a genuinely meaningful differentiator.
What’s Actually in QIMMA
The team consolidated 109 subsets from 14 source benchmarks into a unified evaluation suite of over 52,000 samples across 7 domains:
- Cultural: AraDiCE-Culture, ArabCulture, PalmX (multiple choice)
- STEM: ArabicMMLU, GAT, 3LM STEM (multiple choice)
- Legal: ArabLegalQA, MizanQA (multiple choice and QA)
- Medical: MedArabiQ, MedAraBench (multiple choice and QA)
- Safety: AraTrust (multiple choice)
- Poetry & Literature: FannOrFlop (QA)
- Coding: 3LM HumanEval+, 3LM MBPP+ (code generation)
99% of the content is native Arabic. The only exception is code evaluation, which is inherently language-agnostic. This is also the first Arabic leaderboard to include code evaluation — they adapted HumanEval+ and MBPP+ with Arabic-language problem statements.
The Validation Pipeline: Where the Real Work Happens
This is the methodological core of QIMMA. Before running a single model, they applied a multi-stage validation pipeline to every sample in every benchmark.
Stage 1: Multi-Model Automated Assessment
Each sample got independently evaluated by two state-of-the-art LLMs: Qwen3-235B-A22B-Instruct and DeepSeek-V3-671B. They picked two models with strong Arabic capability but different training data compositions, so their combined judgment is more robust than either alone.
Each model scores a sample against a 10-point rubric with binary scores (0 or 1) per criterion. A sample is eliminated if either model scores it below 7/10. If both models agree on elimination, the sample is dropped immediately. If only one model flags it, it goes to human review.
Stage 2: Human Annotation and Review
Flagged samples get reviewed by native Arabic speakers with cultural and dialectal familiarity. Human annotators make final calls on cultural context, regional variation, dialectal nuance, and subjective interpretation. For culturally sensitive content, multiple perspectives are considered — because “correctness” can genuinely vary across Arab regions.
What They Found: Systematic Quality Problems
The pipeline revealed recurring quality issues across benchmarks. Not isolated errors — systematic problems.
I don’t have the full breakdown of what they found (the original article was truncated), but the fact that they built this pipeline at all tells you the problem is real. I’ve seen similar issues in other language evaluation efforts, and it’s always worse than people expect.
Why This Matters
QIMMA isn’t just another leaderboard. It’s a methodological correction. The team is saying: we need to stop treating benchmarks as ground truth and start treating them as hypotheses that need validation.
If you’re building Arabic LLMs or evaluating them, this is worth paying attention to. The leaderboard, GitHub repo, and paper are all public. Go check your assumptions.
Comments (0)
Login Log in to comment.
Be the first to comment!