import-pdf-processing-anthropic

Category: Documents Risk: High risk ★ 3.9 · Rating 3.9/5 (8) sboghossian/mini-claude-for-legal MIT

Rating is derived from the repo's GitHub stars and shown for reference.

network_accesscredential_access

name: import-pdf-processing-anthropic
description: Use when migrating a PDF document-processing skill originally built for the Anthropic Claude API into the mini-claude-for-legal format. The adapter maps legacy PDF extraction configuration — text layer parsing, OCR fallback, table extraction, form field reading, and annotation handling — into the standard skill model. Critical for legal workflows involving court filings, regulatory submissions, notarised documents, and scanned contracts.
license: MIT
metadata:
id: import.pdf-processing-anthropic
category: import
jurisdictions: [multi]
priority: P3
intent: [import, pdf, document-processing, migration, anthropic]
related: [import-docx-processing-anthropic, import-pptx-processing-anthropic, import-contract-review-anthropic, multimodal-document-ingestion]
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"

Import: PDF Processing (Anthropic)

What it does

This import adapter migrates a PDF document-processing skill originally built for Anthropic's Claude API into the mini-claude-for-legal standard format. PDF is the dominant format for legal documents globally: court filings, notarised instruments, regulatory submissions, signed contracts, government-issued licences, and certified translations all arrive as PDFs.

Unlike DOCX, PDFs may or may not have a selectable text layer. Scanned documents require OCR; encrypted PDFs require a password; form PDFs have interactive fields; certified documents may have a digital signature that must be preserved. The Anthropic-native skill may have used Claude's native PDF vision capability, a dedicated extraction library, or both.

Import config

Field Source mapping Default if absent
extraction_mode Legacy mode text_first_ocr_fallback
ocr_engine Legacy ocr auto (system default)
tables Legacy extract_tables boolean true
form_fields Legacy extract_forms boolean true
annotations Legacy extract_annotations boolean true
digital_signature_check Legacy check_signature boolean true
language Legacy lang auto-detect
rtl_support Legacy rtl boolean true (for MENA)
chunk_size Legacy chunk_tokens 2000
overlap Legacy overlap_tokens 200
output_format Legacy format markdown

Dry-run preview

IMPORT PREVIEW — pdf-processing-anthropic
Source shape     : Anthropic PDF extraction config
Mode             : text_first_ocr_fallback
OCR              : auto
Tables           : extracted
Form fields      : extracted
Annotations      : extracted
Digital signature: checked (not verified cryptographically — visual only)
RTL              : enabled (Arabic/Hebrew support)
Chunk            : 2000 tokens / 200 overlap
Output           : markdown

Extraction pipeline (post-import)

  1. Detect PDF type:

    • Native (selectable text) → direct text extraction
    • Scanned image → OCR pipeline
    • Mixed (some pages scanned, some native) → page-by-page detection
    • Encrypted → request password; log attempt; never store password
  2. Page-by-page extraction: maintain page-number metadata for every extracted element; legal documents frequently reference "page X, clause Y".

  3. Table extraction: detect table boundaries; serialize to markdown pipe tables; flag tables with merged cells as [COMPLEX TABLE — manual verification recommended].

  4. Form field extraction: identify interactive PDF form fields; extract field names and values; flag unsigned or blank required fields.

  5. Annotation extraction: extract highlighted text, comments, and sticky notes; attribute to author if metadata available; append as footnote block.

  6. Digital signature check: detect presence of digital signature (visual check only — not cryptographic validation); flag signed documents separately.

  7. RTL handling: detect right-to-left text (Arabic, Hebrew); preserve paragraph ordering; ensure column order in tables reflects RTL layout.

  8. Chunking: split by token count with overlap; preserve clause boundaries where possible.

Document type Special requirement
Court filing (Lebanon / UAE) Page/paragraph numbering must be preserved; filing stamps and seals should be flagged
Notarised document Notary signature block and certification text must be extracted intact
Certified translation Translator certification page must be preserved as a separate block
Government licence / certificate Issue date, expiry date, and licence number should be extracted as structured metadata
Board resolution / POA Signatory block and witness/notary endorsement are legally significant
Regulatory submission Submission reference number and date should be extracted as metadata

Arabic / bilingual PDF considerations

PDFs from MENA jurisdictions frequently mix Arabic (RTL) and English (LTR) text:

  • Set rtl_support: true to handle RTL paragraphs
  • In bilingual contracts, Arabic is often the prevailing language (UAE and KSA government contracts); flag and note the prevailing language
  • OCR quality for Arabic varies by font and scan quality; flag low-confidence passages

Failure modes

Error Likely cause Resolution
no_text_layer Scanned PDF without OCR Switch to OCR mode; warn user of quality risk
password_protected Encrypted PDF Prompt for password; never log
rtl_reversal RTL paragraphs extracted as LTR Ensure rtl_support: true
table_parse_fail Complex merged-cell tables Flag for manual review
signature_not_found Digital signature present but not detected Flag as [SIGNATURE STATUS UNKNOWN]
arabic_ocr_low_quality Poor scan of Arabic text Warn user; recommend re-scan at 300 DPI minimum
  • [[import-docx-processing-anthropic]]
  • [[import-pptx-processing-anthropic]]
  • [[import-contract-review-anthropic]]
  • [[multimodal-document-ingestion]]
  • [[review-contract-generic]]