import-pdf-processing-anthropic

Category: Documents Risk: High risk ★ 3.9 · Rating 3.9/5 (8) sboghossian/mini-claude-for-legal MIT

Rating is derived from the repo's GitHub stars and shown for reference.

network_accesscredential_access

Download zip View source

name: import-pdf-processing-anthropic
description: Use when migrating a PDF document-processing skill originally built for the Anthropic Claude API into the mini-claude-for-legal format. The adapter maps legacy PDF extraction configuration — text layer parsing, OCR fallback, table extraction, form field reading, and annotation handling — into the standard skill model. Critical for legal workflows involving court filings, regulatory submissions, notarised documents, and scanned contracts.
license: MIT
metadata:
id: import.pdf-processing-anthropic
category: import
jurisdictions: [multi]
priority: P3
intent: [import, pdf, document-processing, migration, anthropic]
related: [import-docx-processing-anthropic, import-pptx-processing-anthropic, import-contract-review-anthropic, multimodal-document-ingestion]
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"

Import: PDF Processing (Anthropic)

What it does

This import adapter migrates a PDF document-processing skill originally built for Anthropic's Claude API into the mini-claude-for-legal standard format. PDF is the dominant format for legal documents globally: court filings, notarised instruments, regulatory submissions, signed contracts, government-issued licences, and certified translations all arrive as PDFs.

Unlike DOCX, PDFs may or may not have a selectable text layer. Scanned documents require OCR; encrypted PDFs require a password; form PDFs have interactive fields; certified documents may have a digital signature that must be preserved. The Anthropic-native skill may have used Claude's native PDF vision capability, a dedicated extraction library, or both.

Import config

Field	Source mapping	Default if absent
`extraction_mode`	Legacy `mode`	`text_first_ocr_fallback`
`ocr_engine`	Legacy `ocr`	`auto` (system default)
`tables`	Legacy `extract_tables` boolean	`true`
`form_fields`	Legacy `extract_forms` boolean	`true`
`annotations`	Legacy `extract_annotations` boolean	`true`
`digital_signature_check`	Legacy `check_signature` boolean	`true`
`language`	Legacy `lang`	`auto-detect`
`rtl_support`	Legacy `rtl` boolean	`true` (for MENA)
`chunk_size`	Legacy `chunk_tokens`	`2000`
`overlap`	Legacy `overlap_tokens`	`200`
`output_format`	Legacy `format`	`markdown`

Dry-run preview

IMPORT PREVIEW — pdf-processing-anthropic
Source shape     : Anthropic PDF extraction config
Mode             : text_first_ocr_fallback
OCR              : auto
Tables           : extracted
Form fields      : extracted
Annotations      : extracted
Digital signature: checked (not verified cryptographically — visual only)
RTL              : enabled (Arabic/Hebrew support)
Chunk            : 2000 tokens / 200 overlap
Output           : markdown

Extraction pipeline (post-import)

Detect PDF type:
- Native (selectable text) → direct text extraction
- Scanned image → OCR pipeline
- Mixed (some pages scanned, some native) → page-by-page detection
- Encrypted → request password; log attempt; never store password
Page-by-page extraction: maintain page-number metadata for every extracted element; legal documents frequently reference "page X, clause Y".
Table extraction: detect table boundaries; serialize to markdown pipe tables; flag tables with merged cells as [COMPLEX TABLE — manual verification recommended].
Form field extraction: identify interactive PDF form fields; extract field names and values; flag unsigned or blank required fields.
Annotation extraction: extract highlighted text, comments, and sticky notes; attribute to author if metadata available; append as footnote block.
Digital signature check: detect presence of digital signature (visual check only — not cryptographic validation); flag signed documents separately.
RTL handling: detect right-to-left text (Arabic, Hebrew); preserve paragraph ordering; ensure column order in tables reflects RTL layout.
Chunking: split by token count with overlap; preserve clause boundaries where possible.

Legal document types and special handling

Document type	Special requirement
Court filing (Lebanon / UAE)	Page/paragraph numbering must be preserved; filing stamps and seals should be flagged
Notarised document	Notary signature block and certification text must be extracted intact
Certified translation	Translator certification page must be preserved as a separate block
Government licence / certificate	Issue date, expiry date, and licence number should be extracted as structured metadata
Board resolution / POA	Signatory block and witness/notary endorsement are legally significant
Regulatory submission	Submission reference number and date should be extracted as metadata

Arabic / bilingual PDF considerations

PDFs from MENA jurisdictions frequently mix Arabic (RTL) and English (LTR) text:

Set rtl_support: true to handle RTL paragraphs
In bilingual contracts, Arabic is often the prevailing language (UAE and KSA government contracts); flag and note the prevailing language
OCR quality for Arabic varies by font and scan quality; flag low-confidence passages

Failure modes

Error	Likely cause	Resolution
`no_text_layer`	Scanned PDF without OCR	Switch to OCR mode; warn user of quality risk
`password_protected`	Encrypted PDF	Prompt for password; never log
`rtl_reversal`	RTL paragraphs extracted as LTR	Ensure `rtl_support: true`
`table_parse_fail`	Complex merged-cell tables	Flag for manual review
`signature_not_found`	Digital signature present but not detected	Flag as `[SIGNATURE STATUS UNKNOWN]`
`arabic_ocr_low_quality`	Poor scan of Arabic text	Warn user; recommend re-scan at 300 DPI minimum

[[import-docx-processing-anthropic]]
[[import-pptx-processing-anthropic]]
[[import-contract-review-anthropic]]
[[multimodal-document-ingestion]]
[[review-contract-generic]]