eng-embedding-model-choice-legal

Category: Documents Risk: Medium risk ★ 3.9 · Rating 3.9/5 (8) sboghossian/mini-claude-for-legal MIT

Rating is derived from the repo's GitHub stars and shown for reference.

network_access

Download zip View source

name: eng-embedding-model-choice-legal
description: Use when selecting an embedding model for semantic search, document similarity, or retrieval-augmented generation (RAG) in a legal AI product. Covers the criteria specific to legal text (multilingual Arabic/English, domain vocabulary, long-document chunking), a comparison of leading embedding models, and configuration recommendations for legal retrieval pipelines serving MENA and common-law jurisdictions.
license: MIT
metadata:
id: eng.embedding-model-choice-legal
category: eng
jurisdictions: [multi]
priority: P2
intent: [embeddings, RAG, semantic-search, vector-database, multilingual]
related:
- eng-context-cache-key-design
- eng-fallback-model-cascade
- eng-langfuse-eval-runner
- eng-mcp-tool-registry
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"

Embedding Model Choice — Legal

What it does

Embedding models convert text into dense vector representations that enable semantic search: "find documents that are conceptually similar to this query, even if they don't share keywords." In legal AI products, embeddings power:

Precedent retrieval (find the most similar prior NDA to the current draft)
KB search (retrieve the most relevant knowledge-base document for this user query)
Conflict-check similarity (find matters involving conceptually similar parties or issues)
Document classification (identify the document type from its content)

The choice of embedding model has a large impact on retrieval quality. Legal text is domain-specific, often multilingual (Arabic + English in MENA), and involves long documents. Generic models trained on web text perform poorly on legal vocabulary without fine-tuning or careful chunking.

Key selection criteria for legal text

Criterion	Why it matters for legal
Multilingual quality (AR + EN)	MENA contracts are often bilingual; queries in Arabic must retrieve English documents and vice versa
Max input length	Contracts can exceed 10,000 tokens; short-context models require aggressive chunking that loses clause context
Domain vocabulary	Legal terms ("indemnification", "escrow", "قضاء", "تعويض") must be represented precisely
Embedding dimension	Higher dimensions → more expressive but higher storage/compute cost
Cost per token	Embedding all firm documents is a large one-time and recurring cost
Latency	Real-time search requires low embedding latency; batch indexing can tolerate higher latency
Data privacy	Some models require sending text to a third-party API; sensitive legal documents may require local/self-hosted models

Model comparison

Model	Max tokens	Dimensions	Multilingual AR	Notes
`text-embedding-3-large` (OpenAI)	8191	3072 (configurable)	Good	Strong general legal performance; high cost at scale
`text-embedding-3-small` (OpenAI)	8191	1536	Good	Lower cost; slight quality trade-off
`voyage-law-2` (Voyage AI)	16000	1024	Moderate	Specifically trained on legal text; best English legal recall
`voyage-multilingual-2` (Voyage AI)	32000	1024	Strong	Best multilingual option tested on AR-EN legal text
`jina-embeddings-v3` (Jina AI)	8192	1024	Strong	Open-weights available; good AR support; self-hostable
`e5-mistral-7b-instruct` (Microsoft)	4096	4096	Moderate	Open weights; good legal quality if fine-tuned
`multilingual-e5-large`	514	1024	Strong	Short context limit; requires chunking; good AR quality

Recommendation for MENA legal AI:

Primary: voyage-multilingual-2 for AR+EN retrieval; or voyage-law-2 for English-only legal retrieval.
Privacy-sensitive deployments: jina-embeddings-v3 (self-hostable) or multilingual-e5-large (self-hostable).
Cost-sensitive: text-embedding-3-small with 1536 dimensions reduced to 256–512 (OpenAI supports dimension reduction via MRL training).

Chunking strategy for legal documents

Legal documents require jurisdiction-aware chunking:

Fixed-window chunking (simplest, weakest)

Split at N tokens with K-token overlap. Loses clause structure. Use only for short documents.

Semantic / structure-aware chunking (recommended)

Split at structural boundaries:

Headings: split at H1/H2/H3 (contract sections, article numbers).
Clause boundaries: each numbered clause is a chunk candidate.
Paragraph boundaries: fallback if clause structure is not present.

For contracts with defined-term sections: embed the definitions chunk as a prefix to every subsequent chunk that uses those terms (increases retrieval precision for clause-level queries).

Hierarchical chunking (best for long contracts)

Embed at two granularities:

Document level: one embedding per document, summarizing the whole.
Clause level: one embedding per clause.

On retrieval: first match at document level (coarse), then re-rank at clause level (fine). This two-stage retrieval is more accurate and cheaper for large corpora.

Metadata to store alongside embeddings

In the vector database, store these fields alongside each chunk's embedding:

{
  "chunk_id": "ulid",
  "document_id": "doc_xxx",
  "matter_id": "matter_xxx | kb_doc",
  "org_id": "org_xxx",
  "document_type": "NDA | SPA | Opinion | ...",
  "jurisdiction": "UAE | DIFC | KSA | ...",
  "governing_law": "UAE | English | ...",
  "language": "en | ar | bilingual",
  "section_title": "Article 5 — Indemnification",
  "chunk_index": 3,
  "total_chunks": 12,
  "approved_at": "ISO-8601",
  "access_tier": "firm-wide | practice | partner-only"
}

Use metadata filters on retrieval to enforce access control — never return a chunk to a user who lacks access to the parent document.

Access control on retrieval

The embedding search pipeline must enforce tenant and document-access controls:

Apply org_id = current_org filter to every query — never allow cross-tenant retrieval.
Apply access_tier IN (user's_permitted_tiers) filter.
For matter-specific documents: apply matter_id = current_matter OR matter_id = 'kb_doc'.

These filters must be applied at the vector database query layer, not post-retrieval in application code.

Evaluation

Evaluate embedding model quality on a legal domain test set using:

NDCG@10: ranking quality of retrieved results.
Recall@5: were the 5 most relevant documents retrieved?
MRR: mean reciprocal rank of the first relevant result.

Use Langfuse (see [[eng-langfuse-eval-runner]]) to run systematic retrieval quality evaluations with legal practitioner-curated test cases.

[[eng-context-cache-key-design]]
[[eng-fallback-model-cascade]]
[[eng-langfuse-eval-runner]]
[[eng-mcp-tool-registry]]