eval-benchmark-runner
Rating is derived from the repo's GitHub stars and shown for reference.
name: eval-benchmark-runner
description: Use when running the automated daily evaluation suite that measures the legal AI system's output quality across all benchmark datasets. Orchestrates the full eval pipeline — loading datasets, calling the production model, scoring with LLM-as-judge rubrics, detecting regressions, and publishing results to the leaderboard and observability dashboards.
license: MIT
metadata:
id: eval.benchmark-runner
category: eval
priority: P0
intent: [eval, benchmark, quality, regression, ci]
related: [eval-llm-as-judge-system-prompt, eval-regression-detector, eval-leaderboard-updater, eval-dataset-nda-prompts-30, eval-dataset-employment-prompts-30, eval-dataset-adversarial-prompts, eval-dataset-multilingual-prompts]
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"
Benchmark Runner
When to use this
The benchmark runner is the automated quality gate for the legal AI system. It runs:
- Daily at 12:00 UTC — tracks quality trend over time.
- On every staging deployment — catches regressions before production.
- On-demand via Slack
/eval-run— for ad-hoc quality checks after prompt-engineering changes.
Do not run it on production mid-traffic; use a staging endpoint or a dedicated eval tenant.
Inputs
| Input | Required | Notes |
|---|---|---|
model |
Yes | Production model slug (e.g., claude-sonnet-4-5) or experimental |
datasets |
Yes | Array of dataset IDs to include; defaults to all |
judgeModels |
Yes | Array of judge model slugs for ensemble scoring |
costBudget |
Yes | Max USD to spend on this run; aborts if exceeded |
runId |
Auto | UUID generated per run |
baselineRunId |
Optional | Previous run to compare against for regression detection |
Review methodology
Step 1 — Load datasets
Load all configured eval.dataset.* files from eval/datasets/*.jsonl. Each JSONL file contains records with:
{ "id": "nda-001", "prompt": "...", "category": "draft", "expected_signals": ["mutual", "confidential_info_defined", "governing_law"] }
Supported datasets: [[eval-dataset-nda-prompts-30]], [[eval-dataset-employment-prompts-30]], [[eval-dataset-real-estate-prompts-30]], [[eval-dataset-research-prompts-30]], [[eval-dataset-adversarial-prompts]], [[eval-dataset-multilingual-prompts]], [[eval-dataset-competitor-comparison-set]].
Step 2 — Run prompts against production model
For each prompt:
- Call the production
/chatendpoint (not the LLM API directly — test the full stack). - Record:
response_text,latency_ms,tokens_input,tokens_output,skills_routed. - Enforce a per-prompt timeout of 120 seconds; log any that exceed it.
- Track cumulative cost; abort if
costBudgetis exceeded.
Run prompts concurrently (max 5 at a time) to keep wall-clock time under 30 minutes.
Step 3 — Score each response
For each response, invoke the [[eval-llm-as-judge-system-prompt]] with the following rubrics active:
- [[eval-rubric-legal-soundness]] (weight: 0.35)
- [[eval-rubric-citation-quality]] (weight: 0.20)
- [[eval-rubric-jurisdiction-awareness]] (weight: 0.20)
- [[eval-rubric-completeness]] (weight: 0.15)
- [[eval-rubric-hallucination-detection]] (weight: 0.10 — binary, auto-fail)
Use an ensemble of judges (e.g., GPT-4o + Claude Sonnet from a different provider + Gemini Pro). Average their scores. Flag any prompt where judge disagreement > 1.5 points for manual review.
Critical rule: never use the same model family as both the system under test and the judge. If testing Claude Sonnet, judges must include at least one non-Claude model.
Step 4 — Compute aggregate scores
Per dataset:
dataset_score = Σ(rubric_score × rubric_weight) / n_prompts
Global aggregate:
aggregate_score = Σ(dataset_score × dataset_weight) / n_datasets
Dataset weights (by business priority): adversarial=0.25, NDA=0.20, employment=0.20, multilingual=0.15, real-estate=0.10, research=0.10.
Also compute:
hallucination_rate= count(hallucinated) / total_promptslatency_p50,latency_p95cost_per_message= total_cost / n_prompts
Step 5 — Detect regressions
Call [[eval-regression-detector]] with current and previous run scores. Regression triggers:
- Any rubric score drops > 5% vs previous run.
- Hallucination rate increases > 0.5%.
- Latency p95 increases > 20%.
- Cost-per-message increases > 15%.
On regression: Slack alert to #eng-quality, auto-create Linear ticket, block deployment promotion if a P0 rubric regressed.
Step 6 — Publish results
- Update [[eval-leaderboard-updater]] with scores.
- Write structured run report to Langfuse.
- Emit PostHog event
eval_run_completedwith aggregate score and regression flag.
Output format
{
"runId": "uuid",
"model": "claude-sonnet-4-5",
"runAt": "2026-05-14T12:00:00Z",
"datasets": {
"nda-prompts-30": { "score": 4.1, "n": 30, "hallucinations": 0 },
"adversarial-prompts": { "score": 4.7, "n": 30, "refusal_rate": 0.97 }
},
"aggregate_score": 4.2,
"hallucination_rate": 0.003,
"latency_p50_ms": 3200,
"latency_p95_ms": 8100,
"cost_per_message_usd": 0.0023,
"regression": false,
"top_failing_prompts": [
{ "id": "research-027", "score": 1.8, "issue": "fabricated statute number" }
]
}
Limits & escalation
- If
hallucination_rate > 1%, halt deployment automatically regardless of other scores. - Recalibrate judge prompts against a human gold-standard label set quarterly.
- Track trend over absolute score — a 4.1 improving steadily from 3.5 is healthier than a 4.5 declining from 4.8.
- Do not over-optimize for benchmark scores — ensure at least 10% of prompts in each dataset are novel and not visible to the model team.
Related skills
- [[eval-llm-as-judge-system-prompt]] — the system prompt driving rubric scoring
- [[eval-regression-detector]] — detects quality drops across runs
- [[eval-leaderboard-updater]] — records scores to the trend dashboard
- [[eval-dataset-nda-prompts-30]] — NDA benchmark dataset
- [[eval-dataset-adversarial-prompts]] — safety and robustness dataset