eng-pii-redaction-preprocessor

Category: Data Risk: Medium risk ★ 3.9 · Rating 3.9/5 (8) sboghossian/mini-claude-for-legal MIT

Rating is derived from the repo's GitHub stars and shown for reference.

network_accessfilesystem_accessautomation_control

Download zip View source

name: eng-pii-redaction-preprocessor
description: Use when building or configuring the PII redaction layer that sanitizes user-submitted text before it is sent to an LLM or stored in a database. Covers entity detection patterns, redaction strategies, audit logging, and reconstruction for legal AI pipelines where client-confidential data must never leak into training or third-party model calls.
license: MIT
metadata:
id: eng.PII-redaction-preprocessor
category: eng
jurisdictions: [multi]
priority: P2
intent: [eng, pii, redaction, privacy, preprocessing]
related: [eng-tenant-isolation-row-level-security, eng-supabase-edge-functions-patterns, eng-rag-chunking-rules-legal-docs, safety-client-confidentiality-cross-tenant]
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"

PII Redaction Preprocessor

What it does

The PII redaction preprocessor is a text-transformation pipeline stage that detects and removes or masks personally identifiable information (PII) from user-submitted legal content before that content is forwarded to an external LLM API call, stored in a search index, or written to a shared database table. In a multi-tenant legal AI product this is a critical safety control: a client of Firm A must never have their counterpart names, passport numbers, or contract figures embedded in a vector store that another tenant can retrieve.

The preprocessor runs synchronously in the request path (≤ 50 ms budget) or, for large document uploads, as an async pre-indexing job.

Setup / auth

No external auth required. The preprocessor is a pure-text pipeline component that can be deployed as:

A Supabase Edge Function (functions/redact-pii/index.ts) invoked before any call to the LLM or embedding API.
A middleware layer in the Express/Hono backend, mounted before the chat-route handler.
A standalone Node.js worker for batch document ingestion.

Dependencies: @presidio/presidio-js (if using Microsoft Presidio via REST), or a custom regex+NER approach. For MENA content, a regex-first approach with Arabic-aware patterns is more reliable than English-trained NER models.

Capabilities

Entity types to detect and redact

Category	Examples	Redaction token
Full name	"Ahmad Al-Rashidi", "Marie Dupont"	`[PERSON]`
National ID / Iqama / Emirates ID	784-XXXX-XXXXXXX-X	`[NATIONAL_ID]`
Passport number	any 7–9 char alphanumeric	`[PASSPORT]`
Phone number	+961-X-XXX-XXXX, +971-5X-XXXXXXX	`[PHONE]`
Email address	RFC 5321 pattern	`[EMAIL]`
IBAN / bank account	LB62XXXX…	`[FINANCIAL_ACCOUNT]`
Company registration number	Lebanon: ش.م.م XXXXXX, UAE: CN-XXXXXXXX	`[COMPANY_REG]`
Address / property number	plot numbers, parcel IDs	`[ADDRESS]`
Date of birth	explicit DOB patterns	`[DOB]`
Contract monetary amounts	optional — flag rather than redact	`[AMOUNT]`

Redaction modes

Replace (default): substitute with typed token ([PERSON]). Preserves sentence structure for downstream LLM comprehension.
Mask: overwrite with █ characters of equal length. Used for rendered PDFs.
Hash: replace with a consistent HMAC-SHA256 keyed hash. Allows cross-document entity resolution without revealing the value. Use for analytics pipelines where re-identification must remain possible under a controlled key.
Delete: remove entity and surrounding whitespace. Use only when the surrounding sentence still makes sense.

Usage patterns

Pattern 1 — Inline chat message preprocessing

// supabase/functions/chat/index.ts (simplified)
import { redactPII } from "../_shared/pii-redactor.ts";

const userMessage = req.body.message;
const { redacted, auditLog } = redactPII(userMessage, {
  mode: "replace",
  locale: detectLocale(userMessage), // "ar" | "en" | "fr"
});

// Forward redacted text to LLM; store auditLog for compliance
const llmResponse = await callClaude(redacted);

Pattern 2 — Document ingestion before embedding

// Before chunking and embedding a uploaded contract PDF
const rawText = await extractTextFromPDF(file);
const { redacted, auditLog } = redactPII(rawText, { mode: "hash", hashKey: TENANT_HASH_KEY });
const chunks = chunkLegalDocument(redacted); // see eng-rag-chunking-rules-legal-docs
await embedAndStore(chunks, tenantId);

Pattern 3 — Audit trail

Every redaction run should emit a structured log entry:

{
  "runId": "uuid",
  "tenantId": "t_xxx",
  "userId": "u_xxx",
  "entitiesFound": [
    { "type": "PERSON", "count": 3, "positions": [[12, 24], [88, 102], [211, 225]] },
    { "type": "PHONE", "count": 1, "positions": [[400, 413]] }
  ],
  "mode": "replace",
  "inputLengthChars": 4820,
  "processedAt": "2026-05-14T09:00:00Z"
}

Store in pii_audit_log table (append-only, tenant-scoped, RLS-protected).

Permissions & safety

The redactor must run before any call to the external LLM API. Never pass raw user text containing PII to Claude/GPT/Gemini without this gate.
The audit log must be stored even when redaction finds zero entities (zero-finding logs help detect evasion).
Hash mode keys must be per-tenant, stored in Supabase Vault (not in code or .env files).
For bilingual AR/EN documents, run two passes: one with Arabic-aware patterns first, then English.
GDPR / PDPL (UAE Federal Decree-Law No. 45 of 2021) and Lebanon Law 81 on electronic transactions all require documented evidence that PII is handled appropriately. The audit log is that evidence.

Failure modes

Failure	Impact	Mitigation
False negative (PII not detected)	PII sent to external model	Ensemble detection: regex + NER; review audit samples weekly
False positive (non-PII redacted)	LLM loses useful context	Use typed tokens so LLM infers entity type; tune regex thresholds
Performance degradation	Chat latency spikes	Run async for large docs; timeout at 200 ms and log
Hash key rotation	Re-identification broken	Version hash keys; store version alongside hash
Arabic diacritic variants	Names missed	Normalize Unicode before pattern matching

[[eng-tenant-isolation-row-level-security]] — RLS ensures redacted data is tenant-scoped in the database
[[eng-rag-chunking-rules-legal-docs]] — chunking runs after redaction in the document pipeline
[[eng-supabase-edge-functions-patterns]] — deployment pattern for the redactor as an Edge Function
[[safety-client-confidentiality-cross-tenant]] — policy that mandates this technical control