eng-rag-chunking-rules-legal-docs
Rating is derived from the repo's GitHub stars and shown for reference.
name: eng-rag-chunking-rules-legal-docs
description: Use when implementing the document ingestion pipeline that splits legal texts into retrieval-optimized chunks for embedding and vector storage. Defines chunk-size targets, boundary rules, overlap strategy, and metadata tagging specific to the structure of legal documents (contracts, legislation, court decisions) across English, Arabic, and French.
license: MIT
metadata:
id: eng.RAG-chunking-rules-legal-docs
category: eng
jurisdictions: [multi]
priority: P2
intent: [eng, rag, chunking, embedding, document-processing]
related: [eng-pii-redaction-preprocessor, eng-supabase-index-knowledge-pipeline, eng-supabase-edge-functions-patterns, eng-tenant-isolation-row-level-security]
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"
RAG Chunking Rules — Legal Documents
What it does
Legal documents have a rigid hierarchical structure (Parts → Articles → Clauses → Sub-clauses) that generic text splitters destroy. Splitting a 200-word Article across two chunks means neither chunk contains the full legal proposition, guaranteeing poor retrieval. This skill defines the chunking strategy for the document ingestion pipeline so that retrieval returns complete, coherent legal units.
Setup / auth
No separate auth. The chunker runs as part of the document ingestion pipeline, after PII redaction ([[eng-pii-redaction-preprocessor]]) and before embedding/storage ([[eng-supabase-index-knowledge-pipeline]]).
Dependencies: a PDF/DOCX text extractor (e.g., pdf-parse, mammoth), a boundary-detection function, and a vector DB insert (Supabase pgvector).
Capabilities
Chunk-size targets
| Document type | Target chunk size | Max | Overlap |
|---|---|---|---|
| Contract clause (short) | 300–500 tokens | 600 | 50 tokens |
| Contract clause (long) | 500–800 tokens | 1000 | 100 tokens |
| Legislative article | 200–600 tokens | 800 | 0 (articles are atomic) |
| Court judgment paragraph | 400–700 tokens | 900 | 80 tokens |
| Recital / preamble | Keep as single chunk | 500 | 0 |
| Definition section | Split by defined term | 200–400 | 0 |
Use token counts (cl100k_base or equivalent), not character counts. Character counts fail badly for Arabic (higher token density).
Boundary detection rules — in priority order
- Article / Clause heading — any line matching patterns such as
Article \d+,المادة \d+,Clause \d+,Section \d+,Article \d+(French). Always start a new chunk at a heading boundary. - Numbered paragraph —
(\d+\.\d+)or(\d+)\.at line start within an article. Sub-clauses ≥ 100 tokens get their own chunk; shorter sub-clauses are merged into the parent clause chunk. - Double newline — blank line between paragraphs in a judgment. Treat as a soft boundary: merge with next paragraph if combined length < target.
- Sentence boundary (fallback only) — split at period + capitalized word only when a chunk would otherwise exceed the hard maximum. Never split within a sentence.
Metadata to attach to every chunk
interface ChunkMetadata {
docId: string; // UUID of parent document
tenantId: string; // firm/workspace UUID
chunkIndex: number; // 0-based sequence within doc
chunkType: "clause" | "article" | "recital" | "definition" | "judgment_para" | "other";
headingText?: string; // nearest parent heading (e.g., "Article 7 — Confidentiality")
pageNumber?: number; // from PDF extractor
language: "ar" | "en" | "fr" | "mixed";
jurisdiction?: string; // if detectable from doc metadata
documentType: "contract" | "legislation" | "judgment" | "regulation" | "unknown";
charStart: number; // character offset in original text
charEnd: number;
}
The headingText field is the single most important retrieval aid: it tells the LLM "this chunk is from Article 7 — Confidentiality" before the user even asks.
Arabic-specific rules
- Arabic contracts number articles right-to-left in the source PDF but the text extraction linearizes them left-to-right. Detect Arabic headings with:
/^(المادة|البند|الفقرة)\s+\d+/. - Arabic legal text is denser (more information per token) than English. Use the smaller end of the target range.
- Mixed AR/EN contracts (Arabizi headers + English body): detect language per paragraph and tag
language: "mixed". - Diacritics (tashkeel) inflate token counts but add no retrieval value. Strip before embedding; keep in stored raw text.
Overlap strategy
- Use overlap only for contract clauses and judgment paragraphs — not for legislative articles (which are self-contained by design).
- Overlap is a trailing tail: append the last N tokens of chunk N to the start of chunk N+1. Do not repeat the heading — duplicate headings confuse rerankers.
- Never use overlap across article/clause heading boundaries. The heading itself acts as the semantic bridge.
Usage patterns
Basic pipeline
async function ingestDocument(file: File, tenantId: string): Promise<void> {
const rawText = await extractText(file); // PDF/DOCX → string
const { redacted } = redactPII(rawText, { mode: "hash" }); // safety gate
const chunks = chunkLegalDocument(redacted, {
language: detectLanguage(redacted),
documentType: classifyDocumentType(redacted),
});
const embeddings = await embedBatch(chunks.map(c => c.text));
await storeChunks(chunks, embeddings, tenantId);
}
chunkLegalDocument function contract
function chunkLegalDocument(
text: string,
options: { language: "ar" | "en" | "fr" | "mixed"; documentType: string }
): Array<{ text: string; metadata: ChunkMetadata }>;
Retrieval hint
When retrieving, always include headingText and documentType in the returned context so the LLM can attribute the source clause in its response.
Permissions & safety
- Chunking must occur after PII redaction. Never embed PII.
- Each chunk's
tenantIdmust be set before database insert; the RLS policy (see [[eng-tenant-isolation-row-level-security]]) enforces that only the owning tenant can retrieve it. - Chunks of documents the user has not shared with the AI should never be returned by vector search. Apply a
WHERE tenant_id = current_tenant_id() AND doc_id = ANY()filter.
Failure modes
| Failure | Impact | Mitigation |
|---|---|---|
| Over-large chunks | Retrieval returns too much irrelevant context; LLM loses focus | Hard-cap at max token count; recursive split if needed |
| Under-large chunks | Single clause split across multiple chunks; proposition incomplete | Merge short adjacent sub-clauses until target met |
| Missing headingText | LLM can't attribute source | Regex heading detection pass before chunking; fallback to page number |
| Arabic text not detected | Wrong tokenization | Run language detection before choosing chunk-size targets |
| Chunking PII before redaction | PII in vector store | Always run redaction first in pipeline |
Related skills
- [[eng-pii-redaction-preprocessor]] — must run before chunking
- [[eng-supabase-index-knowledge-pipeline]] — stores the chunks and embeddings
- [[eng-tenant-isolation-row-level-security]] — RLS that scopes chunk retrieval
- [[eng-supabase-edge-functions-patterns]] — deployment pattern for the ingestion worker