eval-dataset-competitor-comparison-set

Category: General Risk: Medium risk ★ 3.9 · Rating 3.9/5 (8) sboghossian/mini-claude-for-legal MIT

Rating is derived from the repo's GitHub stars and shown for reference.

network_access

Download zip View source

name: eval-dataset-competitor-comparison-set
description: Use when running or interpreting the competitor comparison benchmark that measures the legal AI system's output quality against comparable tools on standardized legal tasks. Provides structured side-by-side scoring methodology, output dimensions, and the MENA-jurisdiction coverage gap analysis that differentiates this system.
license: MIT
metadata:
id: eval.dataset.competitor-comparison-set
category: eval
jurisdictions: [multi]
priority: P2
intent: [eval, competitor, benchmarking, quality, differentiation]
related: [eval-benchmark-runner, eval-rubric-legal-soundness, eval-rubric-jurisdiction-awareness, eval-rubric-hallucination-detection, eval-leaderboard-updater]
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"

Eval Dataset — Competitor Comparison Set

Scope

A curated set of legal AI prompts run against this system and comparable legal AI tools (Harvey, Clio Duo, LexisNexis AI, generic Claude/GPT-4o) to produce a structured quality comparison. The primary hypothesis: this system outperforms general-purpose LLMs and US-centric legal AI tools on MENA-jurisdiction tasks (UAE, KSA, Lebanon, DIFC/ADGM) while being competitive on standard tasks.

Results feed [[eval-leaderboard-updater]] and the weekly AI quality trend report.

How to use this pack

Select 20–30 prompts from the comparison set (see categories below).
Run each prompt through:
- This system (production endpoint)
- The baseline tool(s) (API or UI, documented)
Score each output with [[eval-llm-as-judge-system-prompt]] using [[eval-rubric-legal-soundness]], [[eval-rubric-jurisdiction-awareness]], and [[eval-rubric-hallucination-detection]].
Record scores and qualitative notes in the comparison spreadsheet.
Publish to the internal leaderboard monthly.

Never publish raw competitor output without legal review of their terms of service regarding benchmarking.

Dataset categories

Category A — MENA-jurisdiction tasks (primary differentiator, 10 prompts)

Tasks where MENA-specific knowledge matters:

Prompt type	Sample prompt	Why it differentiates
EOSG calculation	"Calculate end-of-service gratuity for a UAE employee on AED 25,000/month with 7 years' service, terminated without cause."	DIFC vs onshore formula differs; generic LLMs conflate
DIFC contract review	"Review this clause for compliance with DIFC Contract Law."	US/UK tools apply wrong statute
Lebanese Commercial Code	"What are the notice requirements for dissolving a Lebanese LLC under the Commercial Code?"	Obscure to general tools
Notarization requirement	"Does this MOU need to be notarized (Tawtheeq) in Abu Dhabi to be enforceable?"	Tawtheeq nuance absent from US-centric tools
Arabic legal drafting	"Draft a non-disclosure clause in Arabic for use in a Saudi Arabia agreement."	Arabic output quality gap

Category B — Standard legal tasks (parity check, 10 prompts)

Tasks where all tools should perform comparably:

Draft a mutual NDA under New York law.
Summarize the material adverse change clause in this M&A agreement.
What is the statute of limitations for breach of contract in England & Wales?
Flag missing clauses in this employment contract.
Explain the difference between an indemnity and a guarantee.

Scoring on these tasks validates that MENA specialization did not degrade baseline competence.

Category C — Hallucination resistance (5 prompts)

Prompts that invite fabrication:

"Cite the leading DIFC Court case on force majeure." (real cases exist; test if sources are accurate)
"What does the SAMA circular on crypto custody say?" (real document exists; test accuracy)

These measure citation accuracy, not just refusal. A wrong-but-confident citation is worse than an admission of uncertainty.

Category D — Tone and usability (5 prompts)

Prompts evaluated on response quality beyond legal accuracy:

Explanation of a complex concept to a non-lawyer client.
Structured comparison table of jurisdiction options.
Bilingual (AR/EN) clause drafting.

Scoring dimensions

For each prompt, record:

Dimension	Method	Scale
Legal soundness	LLM judge using [[eval-rubric-legal-soundness]]	0–5
Jurisdiction accuracy	LLM judge using [[eval-rubric-jurisdiction-awareness]]	0–5
Hallucination	[[eval-rubric-hallucination-detection]]	clean / hallucinated / uncertain
Response completeness	LLM judge using [[eval-rubric-completeness]]	0–5
Language quality	Human rater for Arabic; LLM for EN	0–5
Time to first token (ms)	Measured	ms

Caveats & currency

Competitor tools update frequently; scores can shift without any change to this system. Re-run quarterly.
The comparison is directional, not definitive. Methodology differences (prompt wording, system prompt visibility, temperature) affect results.
Do not use competitor output in marketing materials without legal review.
This dataset is for internal improvement signal, not for published claims.

[[eval-benchmark-runner]] — runs this dataset as part of the CI eval suite
[[eval-rubric-legal-soundness]] — primary quality scoring rubric
[[eval-rubric-jurisdiction-awareness]] — MENA jurisdiction gap measurement
[[eval-leaderboard-updater]] — records scores to the internal trend dashboard