openclaw-eval-harness-shared
Rating is derived from the repo's GitHub stars and shown for reference.
name: openclaw-eval-harness-shared
description: Use when evaluating the quality, accuracy, or safety of a legal AI skill against a standardized benchmark. The shared eval harness provides community-maintained datasets (NDA, employment, real estate, research), rubrics for legal soundness and hallucination detection, multi-judge scoring to reduce bias, and a public leaderboard for comparing skill quality across providers and versions.
license: MIT
metadata:
id: openclaw.eval-harness-shared
category: openclaw
priority: P2
intent: [openclaw, eval, benchmark, legal-ai-quality, rubric, hallucination]
related: [openclaw-public-skill-registry, openclaw-contrib-template, openclaw-skill-portability-claude-codex-gemini]
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"
OpenClaw — Shared Eval Harness
Purpose
Legal AI skills need to be evaluated against a consistent, reproducible quality bar before they are trusted in professional practice. The OpenClaw Shared Eval Harness is a community-maintained open-source framework that lets skill authors, AI vendors, and legal professionals benchmark legal AI quality across four dimensions: legal soundness, citation quality, hallucination rate, and jurisdictional accuracy.
The harness is intentionally open: anyone can contribute datasets, rubrics, or judge prompts. Vendors may run their models against the harness to produce publicly comparable scores.
Components
1. Datasets
Community-contributed prompt/answer pairs organized by practice area:
| Dataset | Description | Coverage |
|---|---|---|
| NDA dataset | Drafting, review, red-flag analysis | UAE (onshore + DIFC), LB, UK, US |
| Employment dataset | Contracts, non-competes, termination analysis | UAE, KSA, LB, UK, US |
| Real estate dataset | Lease review, SPA analysis, title issues | UAE, LB, EG, UK |
| Research dataset | Regulatory questions, statute lookup, comparative law | Multi-jurisdiction |
| Corporate dataset | SHA, MOU, acquisition terms | DIFC, ADGM, GCC |
Each entry in a dataset contains:
prompt: the user input (in the most realistic form possible)reference_answer: the expected correct response, authored or reviewed by an admitted lawyerjurisdiction: the applicable legal systempractice_area: the skill category being testeddifficulty: easy / medium / hard / trap (where the correct answer is counterintuitive)
2. Rubrics
Evaluation rubrics define what a correct response looks like across dimensions:
Legal soundness (0–5 scale)
- 5: Fully accurate, complete, and actionable; no material omission
- 4: Accurate with minor gaps that would not mislead a practitioner
- 3: Mostly accurate but missing at least one material point
- 2: Partially accurate; some claims are wrong or misleading
- 1: Mostly wrong; could lead a practitioner to a harmful conclusion
- 0: Completely wrong or hallucinates legal authority
Citation quality (0–3 scale)
- 3: All citations accurate and correctly formatted for the jurisdiction
- 2: Citations present but minor formatting or pin-cite errors
- 1: Citations present but at least one is fabricated or wrong
- 0: No citations where they were required, or all citations fabricated
Hallucination rate (binary per claim)
- A claim is a hallucination if it asserts a specific legal fact (statute number, article, case name, threshold) that is factually wrong or non-existent.
- Report hallucination rate as: (number of hallucinated claims) / (total verifiable claims).
Jurisdictional accuracy (0–2 scale)
- 2: Response correctly identifies and applies the applicable jurisdiction
- 1: Response applies the wrong jurisdiction but still gives technically correct advice for that jurisdiction
- 0: Response conflates jurisdictions or applies a clearly wrong legal system
3. Judge models
Single-judge evaluation introduces model-specific bias. The harness uses a panel of at least three judge models (e.g., Claude, GPT-4, Gemini) to score each response independently. The final score is the median across judges after excluding outliers.
Judge prompts are templated and version-controlled in the repository. They include:
- The rubric being applied
- The reference answer (for grounded evaluation)
- Instructions to score independently without knowing which vendor produced the candidate answer
4. Leaderboard
Aggregate scores per vendor/model/skill version are published to the OpenClaw public leaderboard. The leaderboard shows:
- Overall score per dataset
- Per-rubric breakdown
- Version history (so regressions are visible)
- Whether the run was community-verified or vendor-self-reported
Leaderboard entries that are vendor-self-reported are labelled as such; community-verified runs (where an independent reviewer replicated the evaluation) receive a verified badge.
Running the harness
# Install
git clone https://github.com/sboghossian/mini-claude-for-legal
cd mini-claude-for-legal/eval
# Run against a specific skill + dataset
python run_eval.py \
--skill draft-nda-unilateral \
--dataset nda \
--model claude-sonnet-4-5 \
--judges claude,gpt-4o,gemini-1.5-pro \
--output results/my-run.json
Results are written to a JSON file with per-prompt scores and the aggregated summary. To submit to the leaderboard, open a PR against eval/results/ with your run file.
Contributing datasets and rubrics
Contributors should:
- Author reference answers in consultation with an admitted lawyer in the relevant jurisdiction.
- Label each entry with accurate jurisdiction and difficulty tags.
- Flag "trap" entries — cases where the obvious (but wrong) answer is a common AI failure mode.
- Submit via a PR with a brief description of the gap being filled.
Do not submit synthetic reference answers generated purely by AI without practitioner review — the harness is only as good as its ground truth.
Caveats
- Eval scores measure skill output quality at a point in time against a fixed dataset. They are not a guarantee of performance in production.
- Legal standards change. Datasets should be reviewed for currency at least annually. Outdated reference answers are labelled with a staleness warning.
- The harness tests the AI output, not the underlying law. Verify any regulatory claims against primary sources before acting on them in practice.
Related skills
- [[openclaw-public-skill-registry]] — the registry of skills being evaluated
- [[openclaw-contrib-template]] — how to contribute skills and datasets
- [[openclaw-skill-portability-claude-codex-gemini]] — test portability across providers