eng-embedding-model-choice-legal
Rating is derived from the repo's GitHub stars and shown for reference.
name: eng-embedding-model-choice-legal
description: Use when selecting an embedding model for semantic search, document similarity, or retrieval-augmented generation (RAG) in a legal AI product. Covers the criteria specific to legal text (multilingual Arabic/English, domain vocabulary, long-document chunking), a comparison of leading embedding models, and configuration recommendations for legal retrieval pipelines serving MENA and common-law jurisdictions.
license: MIT
metadata:
id: eng.embedding-model-choice-legal
category: eng
jurisdictions: [multi]
priority: P2
intent: [embeddings, RAG, semantic-search, vector-database, multilingual]
related:
- eng-context-cache-key-design
- eng-fallback-model-cascade
- eng-langfuse-eval-runner
- eng-mcp-tool-registry
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"
Embedding Model Choice — Legal
What it does
Embedding models convert text into dense vector representations that enable semantic search: "find documents that are conceptually similar to this query, even if they don't share keywords." In legal AI products, embeddings power:
- Precedent retrieval (find the most similar prior NDA to the current draft)
- KB search (retrieve the most relevant knowledge-base document for this user query)
- Conflict-check similarity (find matters involving conceptually similar parties or issues)
- Document classification (identify the document type from its content)
The choice of embedding model has a large impact on retrieval quality. Legal text is domain-specific, often multilingual (Arabic + English in MENA), and involves long documents. Generic models trained on web text perform poorly on legal vocabulary without fine-tuning or careful chunking.
Key selection criteria for legal text
| Criterion | Why it matters for legal |
|---|---|
| Multilingual quality (AR + EN) | MENA contracts are often bilingual; queries in Arabic must retrieve English documents and vice versa |
| Max input length | Contracts can exceed 10,000 tokens; short-context models require aggressive chunking that loses clause context |
| Domain vocabulary | Legal terms ("indemnification", "escrow", "قضاء", "تعويض") must be represented precisely |
| Embedding dimension | Higher dimensions → more expressive but higher storage/compute cost |
| Cost per token | Embedding all firm documents is a large one-time and recurring cost |
| Latency | Real-time search requires low embedding latency; batch indexing can tolerate higher latency |
| Data privacy | Some models require sending text to a third-party API; sensitive legal documents may require local/self-hosted models |
Model comparison
| Model | Max tokens | Dimensions | Multilingual AR | Notes |
|---|---|---|---|---|
text-embedding-3-large (OpenAI) |
8191 | 3072 (configurable) | Good | Strong general legal performance; high cost at scale |
text-embedding-3-small (OpenAI) |
8191 | 1536 | Good | Lower cost; slight quality trade-off |
voyage-law-2 (Voyage AI) |
16000 | 1024 | Moderate | Specifically trained on legal text; best English legal recall |
voyage-multilingual-2 (Voyage AI) |
32000 | 1024 | Strong | Best multilingual option tested on AR-EN legal text |
jina-embeddings-v3 (Jina AI) |
8192 | 1024 | Strong | Open-weights available; good AR support; self-hostable |
e5-mistral-7b-instruct (Microsoft) |
4096 | 4096 | Moderate | Open weights; good legal quality if fine-tuned |
multilingual-e5-large |
514 | 1024 | Strong | Short context limit; requires chunking; good AR quality |
Recommendation for MENA legal AI:
- Primary:
voyage-multilingual-2for AR+EN retrieval; orvoyage-law-2for English-only legal retrieval. - Privacy-sensitive deployments:
jina-embeddings-v3(self-hostable) ormultilingual-e5-large(self-hostable). - Cost-sensitive:
text-embedding-3-smallwith 1536 dimensions reduced to 256–512 (OpenAI supports dimension reduction via MRL training).
Chunking strategy for legal documents
Legal documents require jurisdiction-aware chunking:
Fixed-window chunking (simplest, weakest)
Split at N tokens with K-token overlap. Loses clause structure. Use only for short documents.
Semantic / structure-aware chunking (recommended)
Split at structural boundaries:
- Headings: split at H1/H2/H3 (contract sections, article numbers).
- Clause boundaries: each numbered clause is a chunk candidate.
- Paragraph boundaries: fallback if clause structure is not present.
For contracts with defined-term sections: embed the definitions chunk as a prefix to every subsequent chunk that uses those terms (increases retrieval precision for clause-level queries).
Hierarchical chunking (best for long contracts)
Embed at two granularities:
- Document level: one embedding per document, summarizing the whole.
- Clause level: one embedding per clause.
On retrieval: first match at document level (coarse), then re-rank at clause level (fine). This two-stage retrieval is more accurate and cheaper for large corpora.
Metadata to store alongside embeddings
In the vector database, store these fields alongside each chunk's embedding:
{
"chunk_id": "ulid",
"document_id": "doc_xxx",
"matter_id": "matter_xxx | kb_doc",
"org_id": "org_xxx",
"document_type": "NDA | SPA | Opinion | ...",
"jurisdiction": "UAE | DIFC | KSA | ...",
"governing_law": "UAE | English | ...",
"language": "en | ar | bilingual",
"section_title": "Article 5 — Indemnification",
"chunk_index": 3,
"total_chunks": 12,
"approved_at": "ISO-8601",
"access_tier": "firm-wide | practice | partner-only"
}
Use metadata filters on retrieval to enforce access control — never return a chunk to a user who lacks access to the parent document.
Access control on retrieval
The embedding search pipeline must enforce tenant and document-access controls:
- Apply
org_id = current_orgfilter to every query — never allow cross-tenant retrieval. - Apply
access_tier IN (user's_permitted_tiers)filter. - For matter-specific documents: apply
matter_id = current_matter OR matter_id = 'kb_doc'.
These filters must be applied at the vector database query layer, not post-retrieval in application code.
Evaluation
Evaluate embedding model quality on a legal domain test set using:
- NDCG@10: ranking quality of retrieved results.
- Recall@5: were the 5 most relevant documents retrieved?
- MRR: mean reciprocal rank of the first relevant result.
Use Langfuse (see [[eng-langfuse-eval-runner]]) to run systematic retrieval quality evaluations with legal practitioner-curated test cases.
Related skills
- [[eng-context-cache-key-design]]
- [[eng-fallback-model-cascade]]
- [[eng-langfuse-eval-runner]]
- [[eng-mcp-tool-registry]]