Cross-Family LLM-Judge Agreement for Institutional RAG

A 5-Family, 9-Judge LLM-as-Judge Ablation

Adisak Sukul · Iowa State University · LLM4Eval @ SIGIR 2027 (under review) · arXiv preprint forthcoming

TL;DR — Open-weight LLM judges win the within-pair race. Qwen 3.6 Plus ↔ Gemma 4 26B reach κ = 0.80, the highest pairwise quadratic-weighted Cohen's κ in our 9 × 9 matrix — exceeding every commercial within-family pair at ~1% of the cost.

Headline numbers

9 × 5judges across 5 model families
570institutional RAG pairs (ISU DSpace)
0.80matrix-max κ (Qwen ↔ Gemma 4)
0.49419-judge ensemble κ vs NIST TREC RAG 2024
95% CI [0.43, 0.56]
65.7%9-judge precision on BEIR scifact
0.34479-judge ensemble κ on TREC-COVID
95% CI [0.24, 0.45]
0.4265UMBRELA single-judge baseline (we beat it)
~$95total paid-API spend (all P1-P3)

Four findings

9-judge κ matrix

9-judge pairwise quadratic-weighted Cohen's kappa heatmap
Pairwise quadratic-weighted Cohen's κ on 570 ISU DSpace retrieval pairs. Every off-diagonal cell ≥ 0.56 (substantial or moderate by Landis-Koch). Two emergent calibration clusters: reasoning-generous (Sonnet, GPT-5.5, DSV4 Pro) and strict-mid + open-weight (GPT-4o, Opus 4.7, Qwen, Gemma 4). Calibration philosophy partitions judges more cleanly than provider lineage.

Why disclosure matters: a 1.9× spread

The same 570 retrieved documents yield wildly different aggregate verdicts depending only on which LLM judges them. nDCG@10 ranges from 0.45 (Gemini 3.1 Pro Preview) to 0.86 (Sonnet 4.6) — a 1.9× spread driven entirely by judge selection. A practitioner using Sonnet concludes the pipeline is excellent; using Gemini 3.1 Prev, the same pipeline appears broken.

JudgeFamilynDCG@10P@5MRRMean
Sonnet 4.6Anthropic0.8620.6490.8261.68
GPT-5.5 (low)OpenAI0.8460.6000.7921.63
DSV4 ProDeepSeek0.8350.7020.8831.63
GPT-4oOpenAI0.8030.3610.5751.15
Qwen 3.6Open-weight0.7580.3790.5991.18
Gemma 4 26BOpen-weight0.7530.3750.5731.11
Opus 4.7Anthropic0.7490.3720.5361.09
Gemini 2.5 ProGoogle0.6240.2810.4860.76
Gemini 3.1 PrevGoogle0.4540.1580.3150.37
Disclosure standard (C4). We propose the field adopt: "nDCG@10 = X.XX via [family]/[model]/[reasoning-config], N pairs, DATE" — the same way drug trials report dose and duration. Without it, "our retriever scores 0.86" is not a comparable measurement.

Mechanism: joint-distribution structure

What predicts pairwise κ? Joint-distribution structure of paired scores — specifically dispersion (variance product) and effective rank — explains R² = 93% of κ variance. Structural factors (provider, reasoning-mode, model class) are fully mediated: they affect κ only through the joint distribution they induce. The shared-tokenizer hypothesis is refuted: Qwen ↔ Gemma 4 vocabulary Jaccard = 0.066 (lowest in slate) yet κ = 0.80 (matrix-highest) — convergence happens at the decision-making layer, not lexical encoding.

Regression decomposition: predictors of pairwise kappa
R² decomposition. Joint-distribution variance product and effective rank dominate; family/reasoning fully mediated.
Calibration mechanism: lenient vs strict clusters
Calibration mechanism. Mean-score × agreement plane: two clusters (reasoning-generous vs strict-mid) partition judges more cleanly than provider family.

External validation against NIST human qrels

To address single-corpus exposure, we replicated the 9-judge slate on two public benchmarks with NIST/BEIR-curated relevance judgments.

TREC RAG 2024 (537 stratified-balanced pairs, 0–3 ordinal κ)

BEIR scifact (300 pairs, all-positive qrels → precision)

TREC-COVID biomedical scientific (300 pairs, 0–2 qrels → 0/2/3 rubric, 9 judges)

Open-weight broadens coverage but lowers ensemble κ more on biomedical (−0.10) than on web (−0.025 on TREC RAG 2024) — Qwen and Gemma 4 individual κ on biomedical (0.32 / 0.27) trail their TREC RAG 2024 numbers (0.41 / 0.40), suggesting the open-weight headline tradeoff is content-domain dependent. Three independent public corpora — web (TREC RAG 2024), scientific fact-verification (BEIR scifact), biomedical (TREC-COVID) — all return ensemble agreement above chance, with the moderate-band landing on web and the fair-near-moderate band on biomedical.

Coverage divergence is a content-domain reliability axis. On TREC RAG 2024 web passages, Anthropic + OpenAI + Qwen + Gemma return scores on 100% of pairs; Gemini 3.1 Prev 24%, Gemini 2.5 Pro 17% (thinking-mode parse aborts on long passages). The always-works 6-judge subset (≥95% coverage on both corpora) is Anthropic + OpenAI + Qwen + Gemma — the 2 cheapest judges make this subset.

Validation pipeline replaces a per-institution IRB study (~$1,500–3,000, 6–12 weeks) with public-NIST evidence (~$56, ~13 hours).

Reproduce

Setup

git clone https://github.com/asukul/RAG-Eval-LLM-Judge
cd RAG-Eval-LLM-Judge
pip install -r requirements.txt
cp .env.template .env       # add OPENAI_API_KEY + OPENROUTER_API_KEY

Within-corpus replication (your own collection)

py -3 src/eval_llm_judge.py \
    --collection <your-collection> \
    --judge-preset p4-frontier
# ~$18.30, ~5.5 h wall

External validation (TREC RAG 2024)

py -3 src/validate_against_trec.py \
    --corpus trec-rag-2024 \
    --judge-preset p4-frontier \
    --max-pairs 537
# ~$35, ~7.5 h wall

Regenerate the κ heatmap

py -3 src/make_kappa_heatmap.py \
    --input results/<your-multijudge-json> \
    --output figures/kappa_matrix_9judge.png

Model versions pinned 2026-04-25. Targeting SIGIR Artifact Badging (Functional/Reusable).

Cite

@inproceedings{sukul2027p4llmjudge,
  title    = {Cross-Family LLM-Judge Agreement for Institutional RAG:
              A 5-Family, 9-Judge Ablation},
  author   = {Sukul, Adisak},
  booktitle = {Proceedings of the LLM4Eval Workshop at SIGIR 2027},
  year     = {2027},
  note     = {arXiv preprint forthcoming},
  url      = {https://github.com/asukul/RAG-Eval-LLM-Judge}
}

See CITATION.cff for machine-readable citation metadata.