eval-regression-detector
Rating is derived from the repo's GitHub stars and shown for reference.
name: eval-regression-detector
description: Use when implementing or triggering the component that compares current benchmark run scores to the previous run and raises alerts when quality drops below threshold. Defines the regression rules, alert thresholds, escalation actions, and investigation flow for quality regressions in the legal AI system.
license: MIT
metadata:
id: eval.regression-detector
category: eval
priority: P0
intent: [eval, regression, quality-gate, alerting, ci]
related: [eval-benchmark-runner, eval-llm-as-judge-system-prompt, eval-leaderboard-updater, eval-rubric-legal-soundness, eval-rubric-hallucination-detection]
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"
Regression Detector
When to use this
The regression detector is called automatically at the end of every [[eval-benchmark-runner]] run, after scores are computed and before [[eval-leaderboard-updater]] records them. It compares current scores to the previous run for the same model and raises alerts if quality has degraded.
It is also the system that blocks automatic promotion from staging to production if a blocking-rubric regression is detected.
Inputs
| Input | Source |
|---|---|
| Current run scores (per-rubric, per-dataset) | eval-benchmark-runner |
| Previous run scores (same model) | eval-leaderboard table |
runId, model, runAt |
eval-benchmark-runner |
| Deployment pipeline hook | CI/CD system |
Logic
Regression thresholds
| Metric | Threshold | Severity |
|---|---|---|
| Any rubric score drops > 5% vs previous run | Alert | P1 — investigate before promoting |
legal_soundness score drops > 3% |
Alert + block deployment | P0 — hard block |
hallucination_rate increases > 0.5 percentage points |
Alert + block deployment | P0 — hard block |
citation_quality score drops > 5% |
Alert | P1 |
jurisdiction_awareness score drops > 5% |
Alert | P1 |
| Latency p95 increases > 20% vs previous run | Alert | P2 — ops issue |
| Cost-per-message increases > 15% | Alert | P2 — economics issue |
A hard block means the CI pipeline returns a non-zero exit code and the deployment is not promoted. It must be manually overridden by a senior engineer with a written justification.
Regression computation
function detectRegressions(current: RunScores, previous: RunScores): RegressionReport {
const regressions: Regression[] = [];
for (const rubric of RUBRICS) {
const currentScore = current.rubrics[rubric];
const prevScore = previous.rubrics[rubric];
const delta = currentScore - prevScore;
const pct = delta / prevScore;
if (pct < -THRESHOLDS[rubric]) {
regressions.push({
rubric,
currentScore,
prevScore,
delta,
pctChange: pct,
severity: SEVERITIES[rubric],
blocking: BLOCKING_RUBRICS.includes(rubric),
});
}
}
return {
hasRegressions: regressions.length > 0,
blockDeployment: regressions.some(r => r.blocking),
regressions,
};
}
Outputs
On regression detected:
- Slack alert to
#eng-qualitywith diff:⚠️ Quality regression detected — claude-sonnet-4-5 legal_soundness: 4.2 → 3.9 (-7.1%) [BLOCKING] citation_quality: 3.8 → 3.6 (-5.3%) Run: 2026-05-14 12:00 UTC [View details → Langfuse] - Linear ticket auto-created with:
- Title: "Quality regression — {model} — {rubric} -{pct}%"
- Description: regression context, top failing prompts, link to run report
- Assignee: on-call engineer
- Priority: P0 (for blocking) or P1 (for non-blocking)
- Block deployment if
blockDeployment = true— CI pipeline exits non-zero.
On no regression:
- Update leaderboard silently (no alert needed).
- Post a brief ✓ to
#eng-qualityonly on the weekly summary, not every run.
Investigation flow
When a regression is detected, follow this procedure before manual override:
- Identify which prompts regressed — sort prompts by score delta; focus on the bottom 10%.
- Compare current vs previous responses side-by-side — often a single prompt change reveals the root cause.
- Inspect skill routing — did the router send certain prompts to a different skill? Check
skills_routedin the run log. - Check for model API changes — did the underlying model version or API behavior change (e.g., new system-prompt behavior, token limits)?
- Check for config drift — did a system prompt, context window size, or temperature change?
- Rollback if P0 — if legal_soundness or hallucination regressed and the cause is not identified within 2 hours, rollback the deployment.
Observability
Connect to:
- PostHog: emit
eval_regression_detectedevent with rubric and delta properties. - Langfuse: link regression report to the run trace for full observability.
Both signals feed the weekly engineering review.
Caveats & currency
- Some score variance between runs is expected (~1–2%) due to LLM judge non-determinism. Only flag regressions that exceed the threshold consistently across 2+ runs before treating them as genuine regressions.
- Recalibrate thresholds annually as the product matures. A threshold that was tight in early development may be too tight once the model quality plateaus at a high level.
- Track trend over absolute score — a model running at 4.2 for 3 months is more trustworthy than one that scored 4.5 once and is declining.
Related skills
- [[eval-benchmark-runner]] — calls this detector after each run
- [[eval-llm-as-judge-system-prompt]] — produces the scores that are compared
- [[eval-leaderboard-updater]] — records results after this detector runs
- [[eval-rubric-legal-soundness]] — primary blocking rubric
- [[eval-rubric-hallucination-detection]] — secondary blocking rubric