eval-leaderboard-updater

Category: General Risk: Low risk ★ 3.9 · Rating 3.9/5 (8) sboghossian/mini-claude-for-legal MIT

Rating is derived from the repo's GitHub stars and shown for reference.

automation_control

name: eval-leaderboard-updater
description: Use when implementing or operating the component that records benchmark run scores to the internal quality leaderboard and weekly AI quality trend report. Maintains the historical score series, computes week-over-week deltas, and surfaces the trend data to the engineering and product teams.
license: MIT
metadata:
id: eval.leaderboard-updater
category: eval
jurisdictions: [multi]
priority: P2
intent: [eval, leaderboard, quality-trend, reporting, ci]
related: [eval-benchmark-runner, eval-regression-detector, eval-llm-as-judge-system-prompt, eval-rubric-legal-soundness]
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"

Leaderboard Updater

When to use this

The leaderboard updater runs automatically at the end of every [[eval-benchmark-runner]] run. It is also triggered manually when historical scores need to be backfilled or when the scoring methodology changes and requires recalibration.

Inputs / signals

Input Source Notes
runId eval-benchmark-runner UUID of the completed benchmark run
runAt eval-benchmark-runner ISO 8601 timestamp
model eval-benchmark-runner Model slug under test
scores eval-benchmark-runner Per-dataset and per-rubric scores
aggregateScore eval-benchmark-runner Weighted aggregate
hallucinationRate eval-benchmark-runner Fraction 0–1
latencyP95Ms eval-benchmark-runner Infrastructure quality signal
costPerMessageUsd eval-benchmark-runner Economics signal
regressionDetected eval-regression-detector Boolean

Logic

Step 1 — Persist to leaderboard table

INSERT INTO eval_leaderboard (
  run_id, run_at, model, aggregate_score, hallucination_rate,
  latency_p95_ms, cost_per_message_usd, regression_detected,
  dataset_scores, rubric_scores, created_at
) VALUES (...)
ON CONFLICT (run_id) DO NOTHING;

The dataset_scores and rubric_scores columns are JSONB, preserving the full per-dataset breakdown.

Step 2 — Compute trend deltas

-- Get the previous run for the same model
SELECT aggregate_score AS prev_score, hallucination_rate AS prev_halluc
FROM eval_leaderboard
WHERE model =  AND run_id != 
ORDER BY run_at DESC LIMIT 1;

Compute:

  • score_delta = current aggregate - previous aggregate
  • hallucination_delta = current hallucination_rate - previous hallucination_rate
  • trend = improving | stable | declining (based on 3-run moving average)

Step 3 — Update the weekly AI quality trend report

Aggregate all runs in the current week and update the weekly_quality_summary table:

{
  "week": "2026-W20",
  "avg_aggregate_score": 4.2,
  "best_run_score": 4.4,
  "worst_run_score": 3.9,
  "hallucination_incidents": 0,
  "regressions_detected": 1,
  "regressions_resolved": 1
}

This data feeds the internal dashboard and the report.weekly-AI-quality-trend report.

Step 4 — Emit leaderboard update notification

Post to Slack #eng-quality with a summary card:

Model quality run: claude-sonnet-4-5 @ 2026-05-14 12:00 UTC
Aggregate: 4.2 / 5.0 (+0.1 vs prev) ✓
Hallucinations: 0 ✓
Regression: None ✓
[View full report → Langfuse link]

If regression detected, post to both #eng-quality and #eng-on-call.

Output

{
  "leaderboardRowId": "uuid",
  "scoreDelta": 0.1,
  "trend": "improving",
  "weekSummaryUpdated": true,
  "slackNotified": true
}

Leaderboard schema

CREATE TABLE eval_leaderboard (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  run_id UUID UNIQUE NOT NULL,
  run_at TIMESTAMPTZ NOT NULL,
  model TEXT NOT NULL,
  aggregate_score NUMERIC(3,2),
  hallucination_rate NUMERIC(5,4),
  latency_p95_ms INT,
  cost_per_message_usd NUMERIC(8,6),
  regression_detected BOOLEAN NOT NULL DEFAULT FALSE,
  dataset_scores JSONB,
  rubric_scores JSONB,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX ON eval_leaderboard (model, run_at DESC);

Why this matters

A single aggregate score per run is not enough information to improve the product. The leaderboard preserves the full historical series so that:

  • Engineers can see whether a prompt-engineering change improved one rubric while degrading another.
  • Product can report "model quality improved 12% over the past quarter."
  • Teams can detect and reverse regressions promptly rather than discovering them in user complaints.
  • The trend (moving average) is more meaningful than any single run's absolute score.

Caveats & currency

Recalibrate rubric weights in [[eval-benchmark-runner]] when the product's practice area mix changes significantly (e.g., if real-estate usage grows to 40% of queries, its dataset weight should increase). When rubric weights change, historical scores are not directly comparable — mark the change in the leaderboard notes column and restart the moving average.

  • [[eval-benchmark-runner]] — the upstream process that calls this updater
  • [[eval-regression-detector]] — provides the regressionDetected signal
  • [[eval-llm-as-judge-system-prompt]] — the scoring engine whose output feeds into scores
  • [[eval-rubric-legal-soundness]] — primary rubric whose trend is most closely tracked