eng-langfuse-eval-runner
Rating is derived from the repo's GitHub stars and shown for reference.
name: eng-langfuse-eval-runner
description: Use when setting up or operating automated quality evaluation of legal AI skill outputs using Langfuse. Covers the evaluation dataset structure for legal domains, judge-model configuration, scoring rubrics specific to legal drafting and analysis, how to run batch evals across skill versions, and how to connect eval results to feature-flag promotion decisions. Engineering skill for legal AI quality assurance.
license: MIT
metadata:
id: eng.langfuse-eval-runner
category: eng
jurisdictions: [multi]
priority: P2
intent: [eval, quality-assurance, langfuse, scoring, testing, LLM-evaluation]
related:
- eng-langfuse-trace-inspector
- eng-feature-flag-rollout-skills
- eng-latency-slo-by-skill
- eng-cost-per-message-tracker
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"
Langfuse Eval Runner
What it does
The eval runner executes systematic quality evaluations of skill outputs, using Langfuse as the orchestration layer. For each skill under test, it:
- Takes a dataset of reference input/output pairs (golden set).
- Runs the current skill version against the inputs.
- Scores the outputs against the golden outputs using a judge model or custom rubric.
- Writes scores back to Langfuse traces.
- Computes aggregated pass rates and quality metrics.
- Feeds results into the feature-flag promotion decision ([[eng-feature-flag-rollout-skills]]).
In a legal AI product, where the cost of a wrong output is potential malpractice exposure, systematic evaluation is non-optional for any skill that produces client-facing legal content.
Evaluation dataset structure
A Langfuse dataset item for a legal skill:
{
"dataset_id": "efirm-conflict-check-eval-v1",
"item_id": "ulid",
"input": {
"skill_id": "efirm-conflict-check",
"context": {
"new_client": "Al Baraka Holdings Ltd",
"counterparties": ["Noor Capital SAOC", "Ali Hassan (individual)"],
"matter_description": "Acquisition of minority stake in a KSA-incorporated company"
},
"user_message": "Run a conflict check for this new matter."
},
"expected_output": {
"result": "CONCERN",
"dimension_flagged": "former_client_check",
"description": "Noor Capital SAOC was represented by the firm in matter 2023-UAE-0081 which may be substantially related"
},
"metadata": {
"skill_version": "1.0",
"practice_area": "corporate",
"jurisdiction": "KSA",
"difficulty": "medium",
"created_by": "legal-qa-team",
"reviewed_by": "partner-id"
}
}
Rubrics by skill category
Conflict check rubrics
| Dimension | Scoring criterion | Weight |
|---|---|---|
| Correct result classification | CLEAN / CONCERN / CONFLICT matches expected | 40% |
| Correct dimension identification | Right conflict dimension flagged | 25% |
| No false negatives | Did not miss a flagged party | 25% |
| Description quality | Actionable, precise description | 10% |
Drafting skill rubrics (engagement letter, NDA, fee quote)
| Dimension | Scoring criterion | Weight |
|---|---|---|
| Completeness | All required sections present | 25% |
| Accuracy | No fabricated statute numbers, incorrect jurisdiction references | 25% |
| Auto-population | All available fields correctly populated | 20% |
| Plain language | Jargon explained; sentence length within standard | 15% |
| Format compliance | Correct headings, numbering, version stamp | 15% |
Advisory / analysis skill rubrics
| Dimension | Scoring criterion | Weight |
|---|---|---|
| Issue identification | Key legal issues identified | 30% |
| Legal accuracy | No invented legal rules; correct jurisdiction reference | 35% |
| Actionability | Concrete recommendations, not hedged generalities | 20% |
| Appropriate scope | Does not exceed what can be reliably stated | 15% |
Judge model configuration
For automated scoring (as opposed to human review), the judge model evaluates each output:
judge_config:
model: claude-sonnet-4-6
system_prompt: |
You are a senior legal quality reviewer. Evaluate the AI output against the
reference output and the rubric. Score each dimension 0–1 (continuous).
Return JSON: {dimension: score, overall: float, pass: bool, notes: string}.
Legal accuracy is paramount. A single fabricated statute number or incorrect
jurisdiction claim is an automatic fail regardless of other scores.
pass_threshold: 0.75 # overall score ≥ 0.75 = pass
legal_accuracy_hard_fail: true # any legal_accuracy < 0.9 = fail regardless
The judge model should never be the same model as the model being evaluated — use a different model or a more capable model as judge.
Running an eval
Single skill, current version
langfuse.run_eval(
dataset_name="efirm-conflict-check-eval-v1",
skill_id="efirm-conflict-check",
skill_version="current",
judge_config="legal-qa-rubric",
run_name="conflict-check-v1-2025-05-14"
)
A/B comparison (two skill versions)
langfuse.run_eval(
dataset_name="efirm-engagement-letter-eval-v2",
experiments=[
{"skill_version": "1.0", "name": "control"},
{"skill_version": "1.1-test", "name": "treatment"}
],
judge_config="drafting-rubric",
significance_threshold=0.05
)
Eval result schema (written back to Langfuse)
{
"run_id": "run_xxx",
"skill_id": "efirm-conflict-check",
"skill_version": "1.0",
"dataset_id": "efirm-conflict-check-eval-v1",
"n_items": 50,
"pass_rate": 0.88,
"avg_score": 0.83,
"scores_by_dimension": {
"result_classification": 0.92,
"dimension_identification": 0.86,
"no_false_negatives": 0.84,
"description_quality": 0.79
},
"failures": [
{"item_id": "xxx", "score": 0.61, "notes": "Missed corporate group adversity"}
],
"promotion_recommendation": "PASS — meets 0.75 threshold",
"run_timestamp": "ISO-8601"
}
Promotion gate
Integrate with [[eng-feature-flag-rollout-skills]]:
- If
pass_rate ≥ 0.85ANDavg_score ≥ 0.80: auto-promote to next rollout stage. - If
pass_rate < 0.75OR any legal-accuracy hard fail: block promotion; alert product team. - Between 0.75 and 0.85: manual review by legal QA partner before promotion.
Golden set maintenance
- Golden sets must be reviewed and updated by a legal practitioner (not just engineering) at least every 6 months.
- When new legal developments occur (new legislation, regulatory guidance), update affected golden items.
- Target: minimum 30 items per skill for statistically meaningful results; 50–100 items for P0 skills.
Related skills
- [[eng-langfuse-trace-inspector]]
- [[eng-feature-flag-rollout-skills]]
- [[eng-latency-slo-by-skill]]
- [[eng-cost-per-message-tracker]]