eng-embedding-model-choice-legal

Category: Documents Risk: Medium risk ★ 3.9 · Rating 3.9/5 (8) sboghossian/mini-claude-for-legal MIT

Rating is derived from the repo's GitHub stars and shown for reference.

network_access

name: eng-embedding-model-choice-legal
description: Use when selecting an embedding model for semantic search, document similarity, or retrieval-augmented generation (RAG) in a legal AI product. Covers the criteria specific to legal text (multilingual Arabic/English, domain vocabulary, long-document chunking), a comparison of leading embedding models, and configuration recommendations for legal retrieval pipelines serving MENA and common-law jurisdictions.
license: MIT
metadata:
id: eng.embedding-model-choice-legal
category: eng
jurisdictions: [multi]
priority: P2
intent: [embeddings, RAG, semantic-search, vector-database, multilingual]
related:
- eng-context-cache-key-design
- eng-fallback-model-cascade
- eng-langfuse-eval-runner
- eng-mcp-tool-registry
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"

Embedding Model Choice — Legal

What it does

Embedding models convert text into dense vector representations that enable semantic search: "find documents that are conceptually similar to this query, even if they don't share keywords." In legal AI products, embeddings power:

  • Precedent retrieval (find the most similar prior NDA to the current draft)
  • KB search (retrieve the most relevant knowledge-base document for this user query)
  • Conflict-check similarity (find matters involving conceptually similar parties or issues)
  • Document classification (identify the document type from its content)

The choice of embedding model has a large impact on retrieval quality. Legal text is domain-specific, often multilingual (Arabic + English in MENA), and involves long documents. Generic models trained on web text perform poorly on legal vocabulary without fine-tuning or careful chunking.

Criterion Why it matters for legal
Multilingual quality (AR + EN) MENA contracts are often bilingual; queries in Arabic must retrieve English documents and vice versa
Max input length Contracts can exceed 10,000 tokens; short-context models require aggressive chunking that loses clause context
Domain vocabulary Legal terms ("indemnification", "escrow", "قضاء", "تعويض") must be represented precisely
Embedding dimension Higher dimensions → more expressive but higher storage/compute cost
Cost per token Embedding all firm documents is a large one-time and recurring cost
Latency Real-time search requires low embedding latency; batch indexing can tolerate higher latency
Data privacy Some models require sending text to a third-party API; sensitive legal documents may require local/self-hosted models

Model comparison

Model Max tokens Dimensions Multilingual AR Notes
text-embedding-3-large (OpenAI) 8191 3072 (configurable) Good Strong general legal performance; high cost at scale
text-embedding-3-small (OpenAI) 8191 1536 Good Lower cost; slight quality trade-off
voyage-law-2 (Voyage AI) 16000 1024 Moderate Specifically trained on legal text; best English legal recall
voyage-multilingual-2 (Voyage AI) 32000 1024 Strong Best multilingual option tested on AR-EN legal text
jina-embeddings-v3 (Jina AI) 8192 1024 Strong Open-weights available; good AR support; self-hostable
e5-mistral-7b-instruct (Microsoft) 4096 4096 Moderate Open weights; good legal quality if fine-tuned
multilingual-e5-large 514 1024 Strong Short context limit; requires chunking; good AR quality

Recommendation for MENA legal AI:

  • Primary: voyage-multilingual-2 for AR+EN retrieval; or voyage-law-2 for English-only legal retrieval.
  • Privacy-sensitive deployments: jina-embeddings-v3 (self-hostable) or multilingual-e5-large (self-hostable).
  • Cost-sensitive: text-embedding-3-small with 1536 dimensions reduced to 256–512 (OpenAI supports dimension reduction via MRL training).

Legal documents require jurisdiction-aware chunking:

Fixed-window chunking (simplest, weakest)

Split at N tokens with K-token overlap. Loses clause structure. Use only for short documents.

Split at structural boundaries:

  1. Headings: split at H1/H2/H3 (contract sections, article numbers).
  2. Clause boundaries: each numbered clause is a chunk candidate.
  3. Paragraph boundaries: fallback if clause structure is not present.

For contracts with defined-term sections: embed the definitions chunk as a prefix to every subsequent chunk that uses those terms (increases retrieval precision for clause-level queries).

Hierarchical chunking (best for long contracts)

Embed at two granularities:

  • Document level: one embedding per document, summarizing the whole.
  • Clause level: one embedding per clause.

On retrieval: first match at document level (coarse), then re-rank at clause level (fine). This two-stage retrieval is more accurate and cheaper for large corpora.

Metadata to store alongside embeddings

In the vector database, store these fields alongside each chunk's embedding:

{
  "chunk_id": "ulid",
  "document_id": "doc_xxx",
  "matter_id": "matter_xxx | kb_doc",
  "org_id": "org_xxx",
  "document_type": "NDA | SPA | Opinion | ...",
  "jurisdiction": "UAE | DIFC | KSA | ...",
  "governing_law": "UAE | English | ...",
  "language": "en | ar | bilingual",
  "section_title": "Article 5 — Indemnification",
  "chunk_index": 3,
  "total_chunks": 12,
  "approved_at": "ISO-8601",
  "access_tier": "firm-wide | practice | partner-only"
}

Use metadata filters on retrieval to enforce access control — never return a chunk to a user who lacks access to the parent document.

Access control on retrieval

The embedding search pipeline must enforce tenant and document-access controls:

  1. Apply org_id = current_org filter to every query — never allow cross-tenant retrieval.
  2. Apply access_tier IN (user's_permitted_tiers) filter.
  3. For matter-specific documents: apply matter_id = current_matter OR matter_id = 'kb_doc'.

These filters must be applied at the vector database query layer, not post-retrieval in application code.

Evaluation

Evaluate embedding model quality on a legal domain test set using:

  • NDCG@10: ranking quality of retrieved results.
  • Recall@5: were the 5 most relevant documents retrieved?
  • MRR: mean reciprocal rank of the first relevant result.

Use Langfuse (see [[eng-langfuse-eval-runner]]) to run systematic retrieval quality evaluations with legal practitioner-curated test cases.

  • [[eng-context-cache-key-design]]
  • [[eng-fallback-model-cascade]]
  • [[eng-langfuse-eval-runner]]
  • [[eng-mcp-tool-registry]]