eng-langfuse-eval-runner

Category: General Risk: Unknown ★ 3.9 · Rating 3.9/5 (8) sboghossian/mini-claude-for-legal MIT

Rating is derived from the repo's GitHub stars and shown for reference.


name: eng-langfuse-eval-runner
description: Use when setting up or operating automated quality evaluation of legal AI skill outputs using Langfuse. Covers the evaluation dataset structure for legal domains, judge-model configuration, scoring rubrics specific to legal drafting and analysis, how to run batch evals across skill versions, and how to connect eval results to feature-flag promotion decisions. Engineering skill for legal AI quality assurance.
license: MIT
metadata:
id: eng.langfuse-eval-runner
category: eng
jurisdictions: [multi]
priority: P2
intent: [eval, quality-assurance, langfuse, scoring, testing, LLM-evaluation]
related:
- eng-langfuse-trace-inspector
- eng-feature-flag-rollout-skills
- eng-latency-slo-by-skill
- eng-cost-per-message-tracker
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"

Langfuse Eval Runner

What it does

The eval runner executes systematic quality evaluations of skill outputs, using Langfuse as the orchestration layer. For each skill under test, it:

  1. Takes a dataset of reference input/output pairs (golden set).
  2. Runs the current skill version against the inputs.
  3. Scores the outputs against the golden outputs using a judge model or custom rubric.
  4. Writes scores back to Langfuse traces.
  5. Computes aggregated pass rates and quality metrics.
  6. Feeds results into the feature-flag promotion decision ([[eng-feature-flag-rollout-skills]]).

In a legal AI product, where the cost of a wrong output is potential malpractice exposure, systematic evaluation is non-optional for any skill that produces client-facing legal content.

Evaluation dataset structure

A Langfuse dataset item for a legal skill:

{
  "dataset_id": "efirm-conflict-check-eval-v1",
  "item_id": "ulid",
  "input": {
    "skill_id": "efirm-conflict-check",
    "context": {
      "new_client": "Al Baraka Holdings Ltd",
      "counterparties": ["Noor Capital SAOC", "Ali Hassan (individual)"],
      "matter_description": "Acquisition of minority stake in a KSA-incorporated company"
    },
    "user_message": "Run a conflict check for this new matter."
  },
  "expected_output": {
    "result": "CONCERN",
    "dimension_flagged": "former_client_check",
    "description": "Noor Capital SAOC was represented by the firm in matter 2023-UAE-0081 which may be substantially related"
  },
  "metadata": {
    "skill_version": "1.0",
    "practice_area": "corporate",
    "jurisdiction": "KSA",
    "difficulty": "medium",
    "created_by": "legal-qa-team",
    "reviewed_by": "partner-id"
  }
}

Rubrics by skill category

Conflict check rubrics

Dimension Scoring criterion Weight
Correct result classification CLEAN / CONCERN / CONFLICT matches expected 40%
Correct dimension identification Right conflict dimension flagged 25%
No false negatives Did not miss a flagged party 25%
Description quality Actionable, precise description 10%

Drafting skill rubrics (engagement letter, NDA, fee quote)

Dimension Scoring criterion Weight
Completeness All required sections present 25%
Accuracy No fabricated statute numbers, incorrect jurisdiction references 25%
Auto-population All available fields correctly populated 20%
Plain language Jargon explained; sentence length within standard 15%
Format compliance Correct headings, numbering, version stamp 15%

Advisory / analysis skill rubrics

Dimension Scoring criterion Weight
Issue identification Key legal issues identified 30%
Legal accuracy No invented legal rules; correct jurisdiction reference 35%
Actionability Concrete recommendations, not hedged generalities 20%
Appropriate scope Does not exceed what can be reliably stated 15%

Judge model configuration

For automated scoring (as opposed to human review), the judge model evaluates each output:

judge_config:
  model: claude-sonnet-4-6
  system_prompt: |
    You are a senior legal quality reviewer. Evaluate the AI output against the 
    reference output and the rubric. Score each dimension 0–1 (continuous). 
    Return JSON: {dimension: score, overall: float, pass: bool, notes: string}.
    
    Legal accuracy is paramount. A single fabricated statute number or incorrect 
    jurisdiction claim is an automatic fail regardless of other scores.
    
  pass_threshold: 0.75        # overall score ≥ 0.75 = pass
  legal_accuracy_hard_fail: true   # any legal_accuracy < 0.9 = fail regardless

The judge model should never be the same model as the model being evaluated — use a different model or a more capable model as judge.

Running an eval

Single skill, current version

langfuse.run_eval(
    dataset_name="efirm-conflict-check-eval-v1",
    skill_id="efirm-conflict-check",
    skill_version="current",
    judge_config="legal-qa-rubric",
    run_name="conflict-check-v1-2025-05-14"
)

A/B comparison (two skill versions)

langfuse.run_eval(
    dataset_name="efirm-engagement-letter-eval-v2",
    experiments=[
        {"skill_version": "1.0", "name": "control"},
        {"skill_version": "1.1-test", "name": "treatment"}
    ],
    judge_config="drafting-rubric",
    significance_threshold=0.05
)

Eval result schema (written back to Langfuse)

{
  "run_id": "run_xxx",
  "skill_id": "efirm-conflict-check",
  "skill_version": "1.0",
  "dataset_id": "efirm-conflict-check-eval-v1",
  "n_items": 50,
  "pass_rate": 0.88,
  "avg_score": 0.83,
  "scores_by_dimension": {
    "result_classification": 0.92,
    "dimension_identification": 0.86,
    "no_false_negatives": 0.84,
    "description_quality": 0.79
  },
  "failures": [
    {"item_id": "xxx", "score": 0.61, "notes": "Missed corporate group adversity"}
  ],
  "promotion_recommendation": "PASS — meets 0.75 threshold",
  "run_timestamp": "ISO-8601"
}

Promotion gate

Integrate with [[eng-feature-flag-rollout-skills]]:

  • If pass_rate ≥ 0.85 AND avg_score ≥ 0.80: auto-promote to next rollout stage.
  • If pass_rate < 0.75 OR any legal-accuracy hard fail: block promotion; alert product team.
  • Between 0.75 and 0.85: manual review by legal QA partner before promotion.

Golden set maintenance

  • Golden sets must be reviewed and updated by a legal practitioner (not just engineering) at least every 6 months.
  • When new legal developments occur (new legislation, regulatory guidance), update affected golden items.
  • Target: minimum 30 items per skill for statistically meaningful results; 50–100 items for P0 skills.
  • [[eng-langfuse-trace-inspector]]
  • [[eng-feature-flag-rollout-skills]]
  • [[eng-latency-slo-by-skill]]
  • [[eng-cost-per-message-tracker]]