eval-regression-detector

Category: General Risk: Medium risk ★ 3.9 · Rating 3.9/5 (8) sboghossian/mini-claude-for-legal MIT

Rating is derived from the repo's GitHub stars and shown for reference.

network_accessautomation_control

name: eval-regression-detector
description: Use when implementing or triggering the component that compares current benchmark run scores to the previous run and raises alerts when quality drops below threshold. Defines the regression rules, alert thresholds, escalation actions, and investigation flow for quality regressions in the legal AI system.
license: MIT
metadata:
id: eval.regression-detector
category: eval
priority: P0
intent: [eval, regression, quality-gate, alerting, ci]
related: [eval-benchmark-runner, eval-llm-as-judge-system-prompt, eval-leaderboard-updater, eval-rubric-legal-soundness, eval-rubric-hallucination-detection]
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"

Regression Detector

When to use this

The regression detector is called automatically at the end of every [[eval-benchmark-runner]] run, after scores are computed and before [[eval-leaderboard-updater]] records them. It compares current scores to the previous run for the same model and raises alerts if quality has degraded.

It is also the system that blocks automatic promotion from staging to production if a blocking-rubric regression is detected.

Inputs

Input Source
Current run scores (per-rubric, per-dataset) eval-benchmark-runner
Previous run scores (same model) eval-leaderboard table
runId, model, runAt eval-benchmark-runner
Deployment pipeline hook CI/CD system

Logic

Regression thresholds

Metric Threshold Severity
Any rubric score drops > 5% vs previous run Alert P1 — investigate before promoting
legal_soundness score drops > 3% Alert + block deployment P0 — hard block
hallucination_rate increases > 0.5 percentage points Alert + block deployment P0 — hard block
citation_quality score drops > 5% Alert P1
jurisdiction_awareness score drops > 5% Alert P1
Latency p95 increases > 20% vs previous run Alert P2 — ops issue
Cost-per-message increases > 15% Alert P2 — economics issue

A hard block means the CI pipeline returns a non-zero exit code and the deployment is not promoted. It must be manually overridden by a senior engineer with a written justification.

Regression computation

function detectRegressions(current: RunScores, previous: RunScores): RegressionReport {
  const regressions: Regression[] = [];

  for (const rubric of RUBRICS) {
    const currentScore = current.rubrics[rubric];
    const prevScore = previous.rubrics[rubric];
    const delta = currentScore - prevScore;
    const pct = delta / prevScore;

    if (pct < -THRESHOLDS[rubric]) {
      regressions.push({
        rubric,
        currentScore,
        prevScore,
        delta,
        pctChange: pct,
        severity: SEVERITIES[rubric],
        blocking: BLOCKING_RUBRICS.includes(rubric),
      });
    }
  }

  return {
    hasRegressions: regressions.length > 0,
    blockDeployment: regressions.some(r => r.blocking),
    regressions,
  };
}

Outputs

On regression detected:

  1. Slack alert to #eng-quality with diff:
    ⚠️ Quality regression detected — claude-sonnet-4-5
    legal_soundness: 4.2 → 3.9 (-7.1%) [BLOCKING]
    citation_quality: 3.8 → 3.6 (-5.3%)
    Run: 2026-05-14 12:00 UTC
    [View details → Langfuse]
    
  2. Linear ticket auto-created with:
    • Title: "Quality regression — {model} — {rubric} -{pct}%"
    • Description: regression context, top failing prompts, link to run report
    • Assignee: on-call engineer
    • Priority: P0 (for blocking) or P1 (for non-blocking)
  3. Block deployment if blockDeployment = true — CI pipeline exits non-zero.

On no regression:

  • Update leaderboard silently (no alert needed).
  • Post a brief ✓ to #eng-quality only on the weekly summary, not every run.

Investigation flow

When a regression is detected, follow this procedure before manual override:

  1. Identify which prompts regressed — sort prompts by score delta; focus on the bottom 10%.
  2. Compare current vs previous responses side-by-side — often a single prompt change reveals the root cause.
  3. Inspect skill routing — did the router send certain prompts to a different skill? Check skills_routed in the run log.
  4. Check for model API changes — did the underlying model version or API behavior change (e.g., new system-prompt behavior, token limits)?
  5. Check for config drift — did a system prompt, context window size, or temperature change?
  6. Rollback if P0 — if legal_soundness or hallucination regressed and the cause is not identified within 2 hours, rollback the deployment.

Observability

Connect to:

  • PostHog: emit eval_regression_detected event with rubric and delta properties.
  • Langfuse: link regression report to the run trace for full observability.

Both signals feed the weekly engineering review.

Caveats & currency

  • Some score variance between runs is expected (~1–2%) due to LLM judge non-determinism. Only flag regressions that exceed the threshold consistently across 2+ runs before treating them as genuine regressions.
  • Recalibrate thresholds annually as the product matures. A threshold that was tight in early development may be too tight once the model quality plateaus at a high level.
  • Track trend over absolute score — a model running at 4.2 for 3 months is more trustworthy than one that scored 4.5 once and is declining.
  • [[eval-benchmark-runner]] — calls this detector after each run
  • [[eval-llm-as-judge-system-prompt]] — produces the scores that are compared
  • [[eval-leaderboard-updater]] — records results after this detector runs
  • [[eval-rubric-legal-soundness]] — primary blocking rubric
  • [[eval-rubric-hallucination-detection]] — secondary blocking rubric