eng-supabase-index-knowledge-pipeline

Category: Documents Risk: Medium risk ★ 3.9 · Rating 3.9/5 (8) sboghossian/mini-claude-for-legal MIT

Rating is derived from the repo's GitHub stars and shown for reference.

network_accessfilesystem_access

name: eng-supabase-index-knowledge-pipeline
description: Use when building or maintaining the pipeline that ingests legal knowledge (uploaded contracts, legislation, regulatory guidance) into the Supabase pgvector index used by the RAG retrieval system. Covers the full flow from document upload through text extraction, chunking, embedding, and vector storage, with multi-tenant isolation and audit logging.
license: MIT
metadata:
id: eng.supabase-index-knowledge-pipeline
category: eng
jurisdictions: [multi]
priority: P2
intent: [eng, supabase, vector, rag, knowledge-base, embedding]
related: [eng-rag-chunking-rules-legal-docs, eng-pii-redaction-preprocessor, eng-tenant-isolation-row-level-security, eng-supabase-edge-functions-patterns]
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"

Supabase Index Knowledge Pipeline

What it does

The index knowledge pipeline is the end-to-end process that takes a raw document (PDF, DOCX, text) uploaded by a legal user and turns it into searchable vector embeddings stored in Supabase with pgvector. When a user asks a question about "our standard supplier NDA" the retrieval system queries these embeddings to find the relevant clauses and passes them to the LLM as context.

The pipeline must be reliable, tenant-isolated, and auditable because it processes client-confidential legal documents.

Setup / auth

Supabase prerequisites:

-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Documents table
CREATE TABLE documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id UUID NOT NULL REFERENCES workspaces(id),
  uploaded_by UUID NOT NULL REFERENCES auth.users(id),
  filename TEXT NOT NULL,
  doc_type TEXT NOT NULL,    -- 'contract' | 'legislation' | 'judgment' | 'kb'
  language TEXT NOT NULL,
  status TEXT NOT NULL DEFAULT 'pending',   -- 'pending' | 'indexing' | 'ready' | 'error'
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  indexed_at TIMESTAMPTZ
);

-- Chunks table with vector column
CREATE TABLE document_chunks (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  doc_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
  tenant_id UUID NOT NULL,
  chunk_index INT NOT NULL,
  chunk_type TEXT,
  heading_text TEXT,
  content TEXT NOT NULL,
  embedding VECTOR(1536),    -- OpenAI text-embedding-3-small dimensions
  metadata JSONB,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Index for fast ANN search
CREATE INDEX ON document_chunks USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

-- RLS
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
ALTER TABLE document_chunks ENABLE ROW LEVEL SECURITY;

CREATE POLICY "tenant isolation documents"
  ON documents FOR ALL
  USING (tenant_id = (SELECT workspace_id FROM memberships WHERE user_id = auth.uid()));

CREATE POLICY "tenant isolation chunks"
  ON document_chunks FOR ALL
  USING (tenant_id = (SELECT workspace_id FROM memberships WHERE user_id = auth.uid()));

Capabilities

Pipeline stages

Upload → Extract → Redact → Chunk → Embed → Store → Index-ready
  1. Upload — file lands in Supabase Storage bucket documents/{tenantId}/{docId}. File size limit: 20 MB. Accepted types: PDF, DOCX, TXT.
  2. Extract — trigger: storage.objects.insert webhook → Edge Function process-document. Extract raw text using pdf-parse or mammoth.
  3. Redact — call [[eng-pii-redaction-preprocessor]] with mode: "hash". Store audit log.
  4. Chunk — apply [[eng-rag-chunking-rules-legal-docs]] rules. Produce Chunk[] with metadata.
  5. Embed — batch-call the embedding API (OpenAI text-embedding-3-small or Cohere embed-multilingual-v3). Batch size: 100 chunks per call. For Arabic text, prefer Cohere multilingual or a model trained on Arabic data.
  6. Store — bulk insert into document_chunks. Update documents.status = 'ready' and indexed_at.
  7. Notify — emit PostHog event document_indexed; send in-app notification to uploader.

Embedding model selection

Language Recommended model Dimension
English text-embedding-3-small 1536
Arabic embed-multilingual-v3 (Cohere) 1024
French text-embedding-3-small 1536
Mixed embed-multilingual-v3 (Cohere) 1024

For mixed-language corpora, standardize on the multilingual model across all languages to ensure cosine similarity is comparable across chunks.

Retrieval query pattern

async function retrieveRelevantChunks(
  query: string,
  tenantId: string,
  docIds?: string[],
  topK: number = 5
): Promise<Chunk[]> {
  const embedding = await embed(query);

  const { data } = await supabase.rpc("match_chunks", {
    query_embedding: embedding,
    tenant_id: tenantId,
    doc_ids: docIds ?? null,
    match_count: topK,
  });
  return data;
}

SQL function match_chunks:

CREATE FUNCTION match_chunks(
  query_embedding VECTOR(1536),
  tenant_id UUID,
  doc_ids UUID[],
  match_count INT
) RETURNS TABLE(id UUID, content TEXT, heading_text TEXT, metadata JSONB, similarity FLOAT)
LANGUAGE SQL AS $
  SELECT id, content, heading_text, metadata,
         1 - (embedding <=> query_embedding) AS similarity
  FROM document_chunks
  WHERE document_chunks.tenant_id = match_chunks.tenant_id
    AND (doc_ids IS NULL OR doc_id = ANY(doc_ids))
  ORDER BY embedding <=> query_embedding
  LIMIT match_count;
$;

Usage patterns

Triggering reindex

If a document's chunks need to be refreshed (e.g., PII hash key rotated):

await supabase.functions.invoke("reindex-document", { body: { docId, tenantId } });

This deletes existing chunks (ON DELETE CASCADE handles it via the document delete + reinsert) and reruns from the extract step.

Incremental knowledge base updates (legislation)

For system-wide knowledge (legislation, regulatory guidance) that is not tenant-specific:

  • Store in a dedicated system_chunks table with tenant_id = null.
  • Apply a separate RLS policy: system_chunks are readable by all authenticated users.
  • Update on a weekly cron (see scheduled tasks).

Permissions & safety

  • Raw (pre-redaction) text must never be written to any database table.
  • Storage bucket documents/ must have RLS; only the owning tenant can read their files.
  • Embedding API calls must not include document metadata fields that contain PII (filename may contain party names — hash it before logging).
  • The pii_audit_log entry must be written before the chunk insert, not after. If chunking fails, the audit log still proves redaction ran.

Failure modes

Failure Impact Mitigation
PDF text extraction returns empty Zero chunks indexed; document silently not searchable Validate extracted text length > 100 chars; set status = 'error' and notify user
Embedding API rate limit Batch fails mid-way Retry with exponential backoff; store last successful chunk index for resumption
Vector dimension mismatch Insert fails Validate embedding dimension against table definition before insert
Large doc > 20 MB Upload rejected Enforce client-side size limit; suggest splitting document
RLS misconfigured Cross-tenant chunk retrieval Integration test: verify that tenant B cannot retrieve tenant A's chunks
  • [[eng-rag-chunking-rules-legal-docs]] — produces the chunks this pipeline stores
  • [[eng-pii-redaction-preprocessor]] — runs before chunking in this pipeline
  • [[eng-tenant-isolation-row-level-security]] — RLS policies applied to documents and chunks tables
  • [[eng-supabase-edge-functions-patterns]] — the Edge Function patterns used by this pipeline