eval-benchmark-runner

Category: Design Risk: Low risk ★ 3.9 · Rating 3.9/5 (8) sboghossian/mini-claude-for-legal MIT

Rating is derived from the repo's GitHub stars and shown for reference.

automation_control

Download zip View source

name: eval-benchmark-runner
description: Use when running the automated daily evaluation suite that measures the legal AI system's output quality across all benchmark datasets. Orchestrates the full eval pipeline — loading datasets, calling the production model, scoring with LLM-as-judge rubrics, detecting regressions, and publishing results to the leaderboard and observability dashboards.
license: MIT
metadata:
id: eval.benchmark-runner
category: eval
priority: P0
intent: [eval, benchmark, quality, regression, ci]
related: [eval-llm-as-judge-system-prompt, eval-regression-detector, eval-leaderboard-updater, eval-dataset-nda-prompts-30, eval-dataset-employment-prompts-30, eval-dataset-adversarial-prompts, eval-dataset-multilingual-prompts]
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"

Benchmark Runner

When to use this

The benchmark runner is the automated quality gate for the legal AI system. It runs:

Daily at 12:00 UTC — tracks quality trend over time.
On every staging deployment — catches regressions before production.
On-demand via Slack /eval-run — for ad-hoc quality checks after prompt-engineering changes.

Do not run it on production mid-traffic; use a staging endpoint or a dedicated eval tenant.

Inputs

Input	Required	Notes
`model`	Yes	Production model slug (e.g., `claude-sonnet-4-5`) or experimental
`datasets`	Yes	Array of dataset IDs to include; defaults to all
`judgeModels`	Yes	Array of judge model slugs for ensemble scoring
`costBudget`	Yes	Max USD to spend on this run; aborts if exceeded
`runId`	Auto	UUID generated per run
`baselineRunId`	Optional	Previous run to compare against for regression detection

Review methodology

Step 1 — Load datasets

Load all configured eval.dataset.* files from eval/datasets/*.jsonl. Each JSONL file contains records with:

{ "id": "nda-001", "prompt": "...", "category": "draft", "expected_signals": ["mutual", "confidential_info_defined", "governing_law"] }

Supported datasets: [[eval-dataset-nda-prompts-30]], [[eval-dataset-employment-prompts-30]], [[eval-dataset-real-estate-prompts-30]], [[eval-dataset-research-prompts-30]], [[eval-dataset-adversarial-prompts]], [[eval-dataset-multilingual-prompts]], [[eval-dataset-competitor-comparison-set]].

Step 2 — Run prompts against production model

For each prompt:

Call the production /chat endpoint (not the LLM API directly — test the full stack).
Record: response_text, latency_ms, tokens_input, tokens_output, skills_routed.
Enforce a per-prompt timeout of 120 seconds; log any that exceed it.
Track cumulative cost; abort if costBudget is exceeded.

Run prompts concurrently (max 5 at a time) to keep wall-clock time under 30 minutes.

Step 3 — Score each response

For each response, invoke the [[eval-llm-as-judge-system-prompt]] with the following rubrics active:

[[eval-rubric-legal-soundness]] (weight: 0.35)
[[eval-rubric-citation-quality]] (weight: 0.20)
[[eval-rubric-jurisdiction-awareness]] (weight: 0.20)
[[eval-rubric-completeness]] (weight: 0.15)
[[eval-rubric-hallucination-detection]] (weight: 0.10 — binary, auto-fail)

Use an ensemble of judges (e.g., GPT-4o + Claude Sonnet from a different provider + Gemini Pro). Average their scores. Flag any prompt where judge disagreement > 1.5 points for manual review.

Critical rule: never use the same model family as both the system under test and the judge. If testing Claude Sonnet, judges must include at least one non-Claude model.

Step 4 — Compute aggregate scores

Per dataset:

dataset_score = Σ(rubric_score × rubric_weight) / n_prompts

Global aggregate:

aggregate_score = Σ(dataset_score × dataset_weight) / n_datasets

Dataset weights (by business priority): adversarial=0.25, NDA=0.20, employment=0.20, multilingual=0.15, real-estate=0.10, research=0.10.

Also compute:

hallucination_rate = count(hallucinated) / total_prompts
latency_p50, latency_p95
cost_per_message = total_cost / n_prompts

Step 5 — Detect regressions

Call [[eval-regression-detector]] with current and previous run scores. Regression triggers:

Any rubric score drops > 5% vs previous run.
Hallucination rate increases > 0.5%.
Latency p95 increases > 20%.
Cost-per-message increases > 15%.

On regression: Slack alert to #eng-quality, auto-create Linear ticket, block deployment promotion if a P0 rubric regressed.

Step 6 — Publish results

Update [[eval-leaderboard-updater]] with scores.
Write structured run report to Langfuse.
Emit PostHog event eval_run_completed with aggregate score and regression flag.

Output format

{
  "runId": "uuid",
  "model": "claude-sonnet-4-5",
  "runAt": "2026-05-14T12:00:00Z",
  "datasets": {
    "nda-prompts-30": { "score": 4.1, "n": 30, "hallucinations": 0 },
    "adversarial-prompts": { "score": 4.7, "n": 30, "refusal_rate": 0.97 }
  },
  "aggregate_score": 4.2,
  "hallucination_rate": 0.003,
  "latency_p50_ms": 3200,
  "latency_p95_ms": 8100,
  "cost_per_message_usd": 0.0023,
  "regression": false,
  "top_failing_prompts": [
    { "id": "research-027", "score": 1.8, "issue": "fabricated statute number" }
  ]
}

Limits & escalation

If hallucination_rate > 1%, halt deployment automatically regardless of other scores.
Recalibrate judge prompts against a human gold-standard label set quarterly.
Track trend over absolute score — a 4.1 improving steadily from 3.5 is healthier than a 4.5 declining from 4.8.
Do not over-optimize for benchmark scores — ensure at least 10% of prompts in each dataset are novel and not visible to the model team.

[[eval-llm-as-judge-system-prompt]] — the system prompt driving rubric scoring
[[eval-regression-detector]] — detects quality drops across runs
[[eval-leaderboard-updater]] — records scores to the trend dashboard
[[eval-dataset-nda-prompts-30]] — NDA benchmark dataset
[[eval-dataset-adversarial-prompts]] — safety and robustness dataset