eng-pii-redaction-preprocessor
Rating is derived from the repo's GitHub stars and shown for reference.
name: eng-pii-redaction-preprocessor
description: Use when building or configuring the PII redaction layer that sanitizes user-submitted text before it is sent to an LLM or stored in a database. Covers entity detection patterns, redaction strategies, audit logging, and reconstruction for legal AI pipelines where client-confidential data must never leak into training or third-party model calls.
license: MIT
metadata:
id: eng.PII-redaction-preprocessor
category: eng
jurisdictions: [multi]
priority: P2
intent: [eng, pii, redaction, privacy, preprocessing]
related: [eng-tenant-isolation-row-level-security, eng-supabase-edge-functions-patterns, eng-rag-chunking-rules-legal-docs, safety-client-confidentiality-cross-tenant]
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"
PII Redaction Preprocessor
What it does
The PII redaction preprocessor is a text-transformation pipeline stage that detects and removes or masks personally identifiable information (PII) from user-submitted legal content before that content is forwarded to an external LLM API call, stored in a search index, or written to a shared database table. In a multi-tenant legal AI product this is a critical safety control: a client of Firm A must never have their counterpart names, passport numbers, or contract figures embedded in a vector store that another tenant can retrieve.
The preprocessor runs synchronously in the request path (≤ 50 ms budget) or, for large document uploads, as an async pre-indexing job.
Setup / auth
No external auth required. The preprocessor is a pure-text pipeline component that can be deployed as:
- A Supabase Edge Function (
functions/redact-pii/index.ts) invoked before any call to the LLM or embedding API. - A middleware layer in the Express/Hono backend, mounted before the chat-route handler.
- A standalone Node.js worker for batch document ingestion.
Dependencies: @presidio/presidio-js (if using Microsoft Presidio via REST), or a custom regex+NER approach. For MENA content, a regex-first approach with Arabic-aware patterns is more reliable than English-trained NER models.
Capabilities
Entity types to detect and redact
| Category | Examples | Redaction token |
|---|---|---|
| Full name | "Ahmad Al-Rashidi", "Marie Dupont" | [PERSON] |
| National ID / Iqama / Emirates ID | 784-XXXX-XXXXXXX-X | [NATIONAL_ID] |
| Passport number | any 7–9 char alphanumeric | [PASSPORT] |
| Phone number | +961-X-XXX-XXXX, +971-5X-XXXXXXX | [PHONE] |
| Email address | RFC 5321 pattern | [EMAIL] |
| IBAN / bank account | LB62XXXX… | [FINANCIAL_ACCOUNT] |
| Company registration number | Lebanon: ش.م.م XXXXXX, UAE: CN-XXXXXXXX | [COMPANY_REG] |
| Address / property number | plot numbers, parcel IDs | [ADDRESS] |
| Date of birth | explicit DOB patterns | [DOB] |
| Contract monetary amounts | optional — flag rather than redact | [AMOUNT] |
Redaction modes
- Replace (default): substitute with typed token (
[PERSON]). Preserves sentence structure for downstream LLM comprehension. - Mask: overwrite with
█characters of equal length. Used for rendered PDFs. - Hash: replace with a consistent HMAC-SHA256 keyed hash. Allows cross-document entity resolution without revealing the value. Use for analytics pipelines where re-identification must remain possible under a controlled key.
- Delete: remove entity and surrounding whitespace. Use only when the surrounding sentence still makes sense.
Usage patterns
Pattern 1 — Inline chat message preprocessing
// supabase/functions/chat/index.ts (simplified)
import { redactPII } from "../_shared/pii-redactor.ts";
const userMessage = req.body.message;
const { redacted, auditLog } = redactPII(userMessage, {
mode: "replace",
locale: detectLocale(userMessage), // "ar" | "en" | "fr"
});
// Forward redacted text to LLM; store auditLog for compliance
const llmResponse = await callClaude(redacted);
Pattern 2 — Document ingestion before embedding
// Before chunking and embedding a uploaded contract PDF
const rawText = await extractTextFromPDF(file);
const { redacted, auditLog } = redactPII(rawText, { mode: "hash", hashKey: TENANT_HASH_KEY });
const chunks = chunkLegalDocument(redacted); // see eng-rag-chunking-rules-legal-docs
await embedAndStore(chunks, tenantId);
Pattern 3 — Audit trail
Every redaction run should emit a structured log entry:
{
"runId": "uuid",
"tenantId": "t_xxx",
"userId": "u_xxx",
"entitiesFound": [
{ "type": "PERSON", "count": 3, "positions": [[12, 24], [88, 102], [211, 225]] },
{ "type": "PHONE", "count": 1, "positions": [[400, 413]] }
],
"mode": "replace",
"inputLengthChars": 4820,
"processedAt": "2026-05-14T09:00:00Z"
}
Store in pii_audit_log table (append-only, tenant-scoped, RLS-protected).
Permissions & safety
- The redactor must run before any call to the external LLM API. Never pass raw user text containing PII to Claude/GPT/Gemini without this gate.
- The audit log must be stored even when redaction finds zero entities (zero-finding logs help detect evasion).
- Hash mode keys must be per-tenant, stored in Supabase Vault (not in code or
.envfiles). - For bilingual AR/EN documents, run two passes: one with Arabic-aware patterns first, then English.
- GDPR / PDPL (UAE Federal Decree-Law No. 45 of 2021) and Lebanon Law 81 on electronic transactions all require documented evidence that PII is handled appropriately. The audit log is that evidence.
Failure modes
| Failure | Impact | Mitigation |
|---|---|---|
| False negative (PII not detected) | PII sent to external model | Ensemble detection: regex + NER; review audit samples weekly |
| False positive (non-PII redacted) | LLM loses useful context | Use typed tokens so LLM infers entity type; tune regex thresholds |
| Performance degradation | Chat latency spikes | Run async for large docs; timeout at 200 ms and log |
| Hash key rotation | Re-identification broken | Version hash keys; store version alongside hash |
| Arabic diacritic variants | Names missed | Normalize Unicode before pattern matching |
Related skills
- [[eng-tenant-isolation-row-level-security]] — RLS ensures redacted data is tenant-scoped in the database
- [[eng-rag-chunking-rules-legal-docs]] — chunking runs after redaction in the document pipeline
- [[eng-supabase-edge-functions-patterns]] — deployment pattern for the redactor as an Edge Function
- [[safety-client-confidentiality-cross-tenant]] — policy that mandates this technical control