eval-rubric-legal-soundness

Category: General Risk: Unknown ★ 3.9 · Rating 3.9/5 (8) sboghossian/mini-claude-for-legal MIT

Rating is derived from the repo's GitHub stars and shown for reference.


name: eval-rubric-legal-soundness
description: Use when scoring AI legal output on whether it correctly states the law and applies it correctly to the facts presented. A 0–5 rubric covering rule statement accuracy, application reasoning, citation reliability, jurisdiction fit, and currency of the law. The primary quality rubric and a deployment blocking gate if it drops significantly.
license: MIT
metadata:
id: eval.rubric.legal-soundness
category: eval
priority: P0
intent: [eval, legal-accuracy, rubric, soundness, quality]
related: [eval-rubric-citation-quality, eval-rubric-jurisdiction-awareness, eval-rubric-completeness, eval-rubric-hallucination-detection, eval-llm-as-judge-system-prompt, eval-benchmark-runner]
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"

Eval Rubric — Legal Soundness (0–5)

When to use this

Apply to every legal AI output that makes substantive legal assertions — analysis, advice, research, and drafting. Legal soundness is the primary quality dimension: a well-written response stating wrong law is dangerous; a poorly formatted response stating correct law is at least safe to act on.

This rubric is a deployment blocking gate: if the aggregate legal soundness score drops > 3% vs the previous run, [[eval-regression-detector]] blocks the deployment for investigation.

Run via [[eval-llm-as-judge-system-prompt]] using an ensemble of judge models.

Scoring (0–5)

Score Label Criteria
5 Excellent All legal propositions correctly stated; citations real and accurate; application reasoning sound and properly tied to the stated facts; counter-arguments or alternative interpretations addressed where they exist; law is current as of the response date
4 Good Legal propositions mostly correct; minor citation inaccuracy (formatting issue or slightly wrong article number) or one missed nuance that does not materially affect the conclusion
3 Acceptable Substantial correct content with at least one moderate error — a wrong article number, a missed exception to a rule, or an application that reaches the right conclusion by imperfect reasoning
2 Poor Significant legal errors or missing an applicable rule that would materially change the advice; would mislead a practitioner who relies on it without independent verification
1 Very poor Multiple serious errors; foundational rule stated wrong; wrong law applied to the jurisdiction
0 Wrong / harmful Materially incorrect to the point of being dangerous to act on — e.g., stating that a type of clause is enforceable when it is void in the stated jurisdiction; advising that no formality is required when notarization is mandatory

Sub-criteria

Rule statement

Is the cited rule the right one, and is it correctly articulated?

  • Is the legal rule identified at the right level of specificity? (Not just "UAE law governs" but the specific statute and principle)
  • Is the rule accurately paraphrased? (Not overstated, not understated)
  • Are exceptions to the rule mentioned where they are material?
  • Is the rule the one that actually applies to the stated facts? (Not a general principle when a specific rule exists)

Application

Does the analysis correctly apply the rule to the facts?

  • Is the application logical? Does it follow from the stated rule?
  • Are key counter-considerations addressed? (e.g., "The contract says X, but UAE courts have the power to modify penalty clauses under Article 390 of the Civil Transactions Law")
  • Is the conclusion justified by the stated rule and facts?

Citations

Are the cited authorities real, correctly attributed, and not fabricated?
(Note: deep citation quality analysis is in [[eval-rubric-citation-quality]]; this sub-criterion is a light check to catch obvious hallucinations)

Jurisdiction fit

Does the answer cover the right jurisdiction?
(Note: deep jurisdiction analysis is in [[eval-rubric-jurisdiction-awareness]]; this sub-criterion catches cases where clearly wrong law is applied)

Currency

Is the law as stated current as of the response date?

  • Laws that have been repealed or significantly amended without acknowledgment score lower.
  • A response that cites the old UAE Labour Law (Federal Law No. 8 of 1980) without noting it was replaced by Federal Decree-Law No. 33 of 2021 scores ≤ 3.
  • For post-training-cutoff changes: the model is not expected to know, but must say "as of my knowledge cutoff" when there is reasonable risk of change.

The following are common failure points for generic LLMs on MENA legal matters — grade strictly on these:

Issue Failure mode
EOSG calculation Wrong formula (21-day vs 30-day, partial vs full for short tenure); ignoring DIFC vs onshore distinction
Penalty clauses Stating they are per se enforceable without noting UAE/Lebanon courts' power to reduce
Non-competes Stating they are easily enforceable in KSA (they are not)
Interest Not flagging Shariah prohibition in KSA; not noting UAE Civil Transactions Law position
Company formation Confusing DIFC, ADGM, onshore, and free-zone rules
Property ownership Not flagging foreign ownership restrictions in UAE onshore / KSA
Choice of law Not noting that UAE Labour Law protections cannot be waived by choice of foreign law for UAE-sited employees

Outputs that make any of these errors score ≤ 3 on legal soundness regardless of overall quality.

Use in automated scoring

Inject this rubric definition into [[eval-llm-as-judge-system-prompt]]. Weight it 0.35 (highest weight) in the composite score:

composite_score = 0.35 × legal_soundness
               + 0.20 × citation_quality
               + 0.20 × jurisdiction_awareness
               + 0.15 × completeness
               + 0.10 × (binary hallucination gate)

Limits & escalation

A legal soundness score alone does not determine whether an output is safe to act on. Always pair with [[eval-rubric-hallucination-detection]] (existence gate) and [[eval-rubric-jurisdiction-awareness]] (applicability gate). A score of 4/5 on legal soundness from a model that regularly fabricates citations is meaningless without the hallucination gate.

  • [[eval-rubric-citation-quality]] — deep citation quality analysis
  • [[eval-rubric-jurisdiction-awareness]] — jurisdiction accuracy
  • [[eval-rubric-completeness]] — whether all applicable rules were addressed
  • [[eval-rubric-hallucination-detection]] — binary fabrication gate
  • [[eval-llm-as-judge-system-prompt]] — applies this rubric in the automated pipeline
  • [[eval-benchmark-runner]] — runs this rubric across all benchmark datasets