eng-supabase-index-knowledge-pipeline
Rating is derived from the repo's GitHub stars and shown for reference.
name: eng-supabase-index-knowledge-pipeline
description: Use when building or maintaining the pipeline that ingests legal knowledge (uploaded contracts, legislation, regulatory guidance) into the Supabase pgvector index used by the RAG retrieval system. Covers the full flow from document upload through text extraction, chunking, embedding, and vector storage, with multi-tenant isolation and audit logging.
license: MIT
metadata:
id: eng.supabase-index-knowledge-pipeline
category: eng
jurisdictions: [multi]
priority: P2
intent: [eng, supabase, vector, rag, knowledge-base, embedding]
related: [eng-rag-chunking-rules-legal-docs, eng-pii-redaction-preprocessor, eng-tenant-isolation-row-level-security, eng-supabase-edge-functions-patterns]
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"
Supabase Index Knowledge Pipeline
What it does
The index knowledge pipeline is the end-to-end process that takes a raw document (PDF, DOCX, text) uploaded by a legal user and turns it into searchable vector embeddings stored in Supabase with pgvector. When a user asks a question about "our standard supplier NDA" the retrieval system queries these embeddings to find the relevant clauses and passes them to the LLM as context.
The pipeline must be reliable, tenant-isolated, and auditable because it processes client-confidential legal documents.
Setup / auth
Supabase prerequisites:
-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Documents table
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES workspaces(id),
uploaded_by UUID NOT NULL REFERENCES auth.users(id),
filename TEXT NOT NULL,
doc_type TEXT NOT NULL, -- 'contract' | 'legislation' | 'judgment' | 'kb'
language TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'pending', -- 'pending' | 'indexing' | 'ready' | 'error'
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
indexed_at TIMESTAMPTZ
);
-- Chunks table with vector column
CREATE TABLE document_chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
doc_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
tenant_id UUID NOT NULL,
chunk_index INT NOT NULL,
chunk_type TEXT,
heading_text TEXT,
content TEXT NOT NULL,
embedding VECTOR(1536), -- OpenAI text-embedding-3-small dimensions
metadata JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Index for fast ANN search
CREATE INDEX ON document_chunks USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
-- RLS
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
ALTER TABLE document_chunks ENABLE ROW LEVEL SECURITY;
CREATE POLICY "tenant isolation documents"
ON documents FOR ALL
USING (tenant_id = (SELECT workspace_id FROM memberships WHERE user_id = auth.uid()));
CREATE POLICY "tenant isolation chunks"
ON document_chunks FOR ALL
USING (tenant_id = (SELECT workspace_id FROM memberships WHERE user_id = auth.uid()));
Capabilities
Pipeline stages
Upload → Extract → Redact → Chunk → Embed → Store → Index-ready
- Upload — file lands in Supabase Storage bucket
documents/{tenantId}/{docId}. File size limit: 20 MB. Accepted types: PDF, DOCX, TXT. - Extract — trigger:
storage.objects.insertwebhook → Edge Functionprocess-document. Extract raw text usingpdf-parseormammoth. - Redact — call [[eng-pii-redaction-preprocessor]] with
mode: "hash". Store audit log. - Chunk — apply [[eng-rag-chunking-rules-legal-docs]] rules. Produce
Chunk[]with metadata. - Embed — batch-call the embedding API (OpenAI
text-embedding-3-smallor Cohereembed-multilingual-v3). Batch size: 100 chunks per call. For Arabic text, prefer Cohere multilingual or a model trained on Arabic data. - Store — bulk insert into
document_chunks. Updatedocuments.status = 'ready'andindexed_at. - Notify — emit PostHog event
document_indexed; send in-app notification to uploader.
Embedding model selection
| Language | Recommended model | Dimension |
|---|---|---|
| English | text-embedding-3-small |
1536 |
| Arabic | embed-multilingual-v3 (Cohere) |
1024 |
| French | text-embedding-3-small |
1536 |
| Mixed | embed-multilingual-v3 (Cohere) |
1024 |
For mixed-language corpora, standardize on the multilingual model across all languages to ensure cosine similarity is comparable across chunks.
Retrieval query pattern
async function retrieveRelevantChunks(
query: string,
tenantId: string,
docIds?: string[],
topK: number = 5
): Promise<Chunk[]> {
const embedding = await embed(query);
const { data } = await supabase.rpc("match_chunks", {
query_embedding: embedding,
tenant_id: tenantId,
doc_ids: docIds ?? null,
match_count: topK,
});
return data;
}
SQL function match_chunks:
CREATE FUNCTION match_chunks(
query_embedding VECTOR(1536),
tenant_id UUID,
doc_ids UUID[],
match_count INT
) RETURNS TABLE(id UUID, content TEXT, heading_text TEXT, metadata JSONB, similarity FLOAT)
LANGUAGE SQL AS $
SELECT id, content, heading_text, metadata,
1 - (embedding <=> query_embedding) AS similarity
FROM document_chunks
WHERE document_chunks.tenant_id = match_chunks.tenant_id
AND (doc_ids IS NULL OR doc_id = ANY(doc_ids))
ORDER BY embedding <=> query_embedding
LIMIT match_count;
$;
Usage patterns
Triggering reindex
If a document's chunks need to be refreshed (e.g., PII hash key rotated):
await supabase.functions.invoke("reindex-document", { body: { docId, tenantId } });
This deletes existing chunks (ON DELETE CASCADE handles it via the document delete + reinsert) and reruns from the extract step.
Incremental knowledge base updates (legislation)
For system-wide knowledge (legislation, regulatory guidance) that is not tenant-specific:
- Store in a dedicated
system_chunkstable withtenant_id = null. - Apply a separate RLS policy:
system_chunksare readable by all authenticated users. - Update on a weekly cron (see scheduled tasks).
Permissions & safety
- Raw (pre-redaction) text must never be written to any database table.
- Storage bucket
documents/must have RLS; only the owning tenant can read their files. - Embedding API calls must not include document metadata fields that contain PII (filename may contain party names — hash it before logging).
- The
pii_audit_logentry must be written before the chunk insert, not after. If chunking fails, the audit log still proves redaction ran.
Failure modes
| Failure | Impact | Mitigation |
|---|---|---|
| PDF text extraction returns empty | Zero chunks indexed; document silently not searchable | Validate extracted text length > 100 chars; set status = 'error' and notify user |
| Embedding API rate limit | Batch fails mid-way | Retry with exponential backoff; store last successful chunk index for resumption |
| Vector dimension mismatch | Insert fails | Validate embedding dimension against table definition before insert |
| Large doc > 20 MB | Upload rejected | Enforce client-side size limit; suggest splitting document |
| RLS misconfigured | Cross-tenant chunk retrieval | Integration test: verify that tenant B cannot retrieve tenant A's chunks |
Related skills
- [[eng-rag-chunking-rules-legal-docs]] — produces the chunks this pipeline stores
- [[eng-pii-redaction-preprocessor]] — runs before chunking in this pipeline
- [[eng-tenant-isolation-row-level-security]] — RLS policies applied to documents and chunks tables
- [[eng-supabase-edge-functions-patterns]] — the Edge Function patterns used by this pipeline