eval-rubric-legal-soundness
Rating is derived from the repo's GitHub stars and shown for reference.
name: eval-rubric-legal-soundness
description: Use when scoring AI legal output on whether it correctly states the law and applies it correctly to the facts presented. A 0–5 rubric covering rule statement accuracy, application reasoning, citation reliability, jurisdiction fit, and currency of the law. The primary quality rubric and a deployment blocking gate if it drops significantly.
license: MIT
metadata:
id: eval.rubric.legal-soundness
category: eval
priority: P0
intent: [eval, legal-accuracy, rubric, soundness, quality]
related: [eval-rubric-citation-quality, eval-rubric-jurisdiction-awareness, eval-rubric-completeness, eval-rubric-hallucination-detection, eval-llm-as-judge-system-prompt, eval-benchmark-runner]
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"
Eval Rubric — Legal Soundness (0–5)
When to use this
Apply to every legal AI output that makes substantive legal assertions — analysis, advice, research, and drafting. Legal soundness is the primary quality dimension: a well-written response stating wrong law is dangerous; a poorly formatted response stating correct law is at least safe to act on.
This rubric is a deployment blocking gate: if the aggregate legal soundness score drops > 3% vs the previous run, [[eval-regression-detector]] blocks the deployment for investigation.
Run via [[eval-llm-as-judge-system-prompt]] using an ensemble of judge models.
Scoring (0–5)
| Score | Label | Criteria |
|---|---|---|
| 5 | Excellent | All legal propositions correctly stated; citations real and accurate; application reasoning sound and properly tied to the stated facts; counter-arguments or alternative interpretations addressed where they exist; law is current as of the response date |
| 4 | Good | Legal propositions mostly correct; minor citation inaccuracy (formatting issue or slightly wrong article number) or one missed nuance that does not materially affect the conclusion |
| 3 | Acceptable | Substantial correct content with at least one moderate error — a wrong article number, a missed exception to a rule, or an application that reaches the right conclusion by imperfect reasoning |
| 2 | Poor | Significant legal errors or missing an applicable rule that would materially change the advice; would mislead a practitioner who relies on it without independent verification |
| 1 | Very poor | Multiple serious errors; foundational rule stated wrong; wrong law applied to the jurisdiction |
| 0 | Wrong / harmful | Materially incorrect to the point of being dangerous to act on — e.g., stating that a type of clause is enforceable when it is void in the stated jurisdiction; advising that no formality is required when notarization is mandatory |
Sub-criteria
Rule statement
Is the cited rule the right one, and is it correctly articulated?
- Is the legal rule identified at the right level of specificity? (Not just "UAE law governs" but the specific statute and principle)
- Is the rule accurately paraphrased? (Not overstated, not understated)
- Are exceptions to the rule mentioned where they are material?
- Is the rule the one that actually applies to the stated facts? (Not a general principle when a specific rule exists)
Application
Does the analysis correctly apply the rule to the facts?
- Is the application logical? Does it follow from the stated rule?
- Are key counter-considerations addressed? (e.g., "The contract says X, but UAE courts have the power to modify penalty clauses under Article 390 of the Civil Transactions Law")
- Is the conclusion justified by the stated rule and facts?
Citations
Are the cited authorities real, correctly attributed, and not fabricated?
(Note: deep citation quality analysis is in [[eval-rubric-citation-quality]]; this sub-criterion is a light check to catch obvious hallucinations)
Jurisdiction fit
Does the answer cover the right jurisdiction?
(Note: deep jurisdiction analysis is in [[eval-rubric-jurisdiction-awareness]]; this sub-criterion catches cases where clearly wrong law is applied)
Currency
Is the law as stated current as of the response date?
- Laws that have been repealed or significantly amended without acknowledgment score lower.
- A response that cites the old UAE Labour Law (Federal Law No. 8 of 1980) without noting it was replaced by Federal Decree-Law No. 33 of 2021 scores ≤ 3.
- For post-training-cutoff changes: the model is not expected to know, but must say "as of my knowledge cutoff" when there is reasonable risk of change.
MENA-specific legal soundness checkpoints
The following are common failure points for generic LLMs on MENA legal matters — grade strictly on these:
| Issue | Failure mode |
|---|---|
| EOSG calculation | Wrong formula (21-day vs 30-day, partial vs full for short tenure); ignoring DIFC vs onshore distinction |
| Penalty clauses | Stating they are per se enforceable without noting UAE/Lebanon courts' power to reduce |
| Non-competes | Stating they are easily enforceable in KSA (they are not) |
| Interest | Not flagging Shariah prohibition in KSA; not noting UAE Civil Transactions Law position |
| Company formation | Confusing DIFC, ADGM, onshore, and free-zone rules |
| Property ownership | Not flagging foreign ownership restrictions in UAE onshore / KSA |
| Choice of law | Not noting that UAE Labour Law protections cannot be waived by choice of foreign law for UAE-sited employees |
Outputs that make any of these errors score ≤ 3 on legal soundness regardless of overall quality.
Use in automated scoring
Inject this rubric definition into [[eval-llm-as-judge-system-prompt]]. Weight it 0.35 (highest weight) in the composite score:
composite_score = 0.35 × legal_soundness
+ 0.20 × citation_quality
+ 0.20 × jurisdiction_awareness
+ 0.15 × completeness
+ 0.10 × (binary hallucination gate)
Limits & escalation
A legal soundness score alone does not determine whether an output is safe to act on. Always pair with [[eval-rubric-hallucination-detection]] (existence gate) and [[eval-rubric-jurisdiction-awareness]] (applicability gate). A score of 4/5 on legal soundness from a model that regularly fabricates citations is meaningless without the hallucination gate.
Related skills
- [[eval-rubric-citation-quality]] — deep citation quality analysis
- [[eval-rubric-jurisdiction-awareness]] — jurisdiction accuracy
- [[eval-rubric-completeness]] — whether all applicable rules were addressed
- [[eval-rubric-hallucination-detection]] — binary fabrication gate
- [[eval-llm-as-judge-system-prompt]] — applies this rubric in the automated pipeline
- [[eval-benchmark-runner]] — runs this rubric across all benchmark datasets