eng-context-cache-key-design

Category: Design Risk: Medium risk ★ 3.9 · Rating 3.9/5 (8) sboghossian/mini-claude-for-legal MIT

Rating is derived from the repo's GitHub stars and shown for reference.

network_access

Download zip View source

name: eng-context-cache-key-design
description: Use when designing the context-caching layer for a legal AI product built on Claude or similar LLMs. Defines how to construct cache keys that maximize prefix-cache hit rates, how to partition context by scope (system prompt, skill, matter, user), how to handle cache invalidation when legal content changes, and how to measure cache efficiency. Engineering skill with significant cost and latency implications for legal AI deployments.
license: MIT
metadata:
id: eng.context-cache-key-design
category: eng
jurisdictions: [multi]
priority: P2
intent: [context-cache, prompt-cache, performance, cost-reduction, LLM]
related:
- eng-cost-per-message-tracker
- eng-latency-slo-by-skill
- eng-fallback-model-cascade
- eng-embedding-model-choice-legal
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"

Context Cache Key Design

What it does

Context (prompt) caching allows an LLM provider to reuse the KV-cache for a prefix of the prompt that has been seen before, avoiding recomputation. For Anthropic's Claude, prompt caching is available via the cache_control parameter; tokens in cached prefixes are charged at a lower rate and processed faster.

In a legal AI product with large system prompts, extensive skill libraries, and per-matter context documents, caching is a primary lever for:

Cost reduction: cached input tokens are typically priced at ~10% of uncached input token cost.
Latency reduction: cache hits skip the prefill computation for cached tokens.
Predictability: stable prefixes ensure consistent latency SLOs by skill.

Context layers in a legal AI product

A typical request context in a legal AI platform has four layers, ordered from most-stable (best to cache) to least-stable (cannot cache):

Layer 1: System prompt (MOST STABLE)
  — Firm identity, core behavior rules, safety guardrails
  — Changes: at deployment; on firm onboarding
  — Cache lifetime: days to weeks

Layer 2: Skill content (STABLE)
  — Loaded SKILL.md files for the current skill set
  — Changes: on skill version updates
  — Cache lifetime: hours to days

Layer 3: Matter context (SEMI-STABLE)
  — Engagement letter, matter summary, key documents for this matter
  — Changes: as matter progresses
  — Cache lifetime: minutes to hours (within a session)

Layer 4: User turn (EPHEMERAL)
  — The specific user message in this request
  — Changes: every request
  — Cache lifetime: not cached

Cache key construction

A cache key identifies a unique, cacheable prefix. The key must encode all dimensions that affect the prefix content.

Layer 1 — system prompt key

cache_key = SHA256(
  org_id +
  system_prompt_version +
  firm_profile_hash
)

The system prompt should be stored as a versioned artifact. Any change to the system prompt (including firm profile data) must increment the version, invalidating the layer-1 cache.

Layer 2 — skill content key

cache_key = SHA256(
  org_id +
  system_prompt_version +           # includes layer 1
  skill_set_hash                    # hash of all loaded skill IDs + versions
)

The skill set hash is computed from the sorted list of {skill_id}:{version} pairs loaded for this request. If the same skill set is always loaded, this key is stable and the cache hit rate is high. Avoid per-request dynamic skill loading — it destroys layer-2 cache efficiency.

Layer 3 — matter context key

cache_key = SHA256(
  org_id +
  system_prompt_version +
  skill_set_hash +
  matter_id +
  matter_context_version            # increments when matter documents change
)

Matter context documents (engagement letter, key filings, current draft under review) change as the matter progresses. The matter_context_version is a monotonic counter managed by the eFirm system, incremented each time any matter-context document changes.

Layer 4 — user turn

Not cached. The user message is always appended fresh.

Prefix ordering rule

The cache prefix must be assembled in strict layer order (1 → 2 → 3 → 4). The provider's cache only hits if the prefix is an exact byte-for-byte match of a previously seen prefix. Any reordering of tokens, even if semantically equivalent, is a cache miss.

Final prompt = [Layer 1: system] + [Layer 2: skills] + [Layer 3: matter] + [Layer 4: user]

Always construct in this order. Never interleave layers.

Cache invalidation triggers

Event	Layer invalidated	Action
System prompt update	L1, L2, L3	Increment system_prompt_version; all caches must be rebuilt
Skill version update	L2, L3	Increment skill version in the skill registry
New skill loaded for org	L2, L3	Recompute skill_set_hash
Matter document updated	L3 only	Increment matter_context_version for that matter
Firm profile changed	L1, L2, L3	Treat as system prompt update

Measuring cache efficiency

Track per-request:

Metric	Description	Target
`cache_hit_rate_l1`	% of requests where Layer 1 was a cache hit	>95%
`cache_hit_rate_l2`	% of requests where Layer 2 was a cache hit	>80%
`cache_hit_rate_l3`	% of requests where Layer 3 was a cache hit	>40% (varies by matter activity level)
`input_tokens_billed_uncached`	Tokens billed at full price	Minimize
`input_tokens_billed_cached`	Tokens billed at reduced price	Maximize
`cache_miss_cost_spike`	Request where all three layers are misses	Alert if > [threshold] in an hour

These metrics feed into [[eng-cost-per-message-tracker]].

Legal product specifics

Privilege isolation: matter context (Layer 3) is per-matter and per-client. Never share a Layer-3 cache prefix across different clients. Enforce tenant isolation at the cache key level — org_id must be in every key.
Confidentiality on cache servers: if using Anthropic's prompt caching, review the data-processing terms. For highly sensitive matters, consider whether to opt into caching at all (accept the cost/latency trade-off for maximum data isolation).
Skill library size: legal AI products load many skills. If the total skill content exceeds the provider's max cacheable prefix (e.g., ~200K tokens for Claude), prioritize which skills are in the "always loaded" set vs. loaded on demand. Always-loaded skills benefit from the Layer-2 cache; on-demand skills are always cache misses at Layer 2.

[[eng-cost-per-message-tracker]]
[[eng-latency-slo-by-skill]]
[[eng-fallback-model-cascade]]
[[eng-embedding-model-choice-legal]]