eng-fallback-model-cascade
Rating is derived from the repo's GitHub stars and shown for reference.
name: eng-fallback-model-cascade
description: Use when designing the model-selection and fallback logic for a legal AI product — defining which model to use for which skill tier, how to cascade to a cheaper or faster model when the primary model is unavailable or over budget, and how to handle failures gracefully without exposing errors to legal practitioners. Engineering skill with direct impact on availability SLOs and cost management.
license: MIT
metadata:
id: eng.fallback-model-cascade
category: eng
jurisdictions: [multi]
priority: P2
intent: [fallback, model-selection, availability, resilience, cascade]
related:
- eng-latency-slo-by-skill
- eng-cost-per-message-tracker
- eng-context-cache-key-design
- eng-feature-flag-rollout-skills
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"
Fallback Model Cascade
What it does
A fallback model cascade is the routing logic that selects which LLM to use for a given request and what to do when the preferred model is unavailable, slow, or over the cost budget. In a legal AI product, where lawyers depend on reliable responses for deadline-critical work, a cascade must:
- Serve the best model for the skill tier (heavier skills get the capable model; lightweight skills can run on a faster, cheaper model).
- Degrade gracefully when the primary model returns a 529 (overload), 500, or timeout.
- Stay within cost budgets if a tenant is on a metered plan.
- Be transparent to the user when a fallback occurred — a legal practitioner may care that a "draft NDA" was produced by a less capable model.
Model tier definitions
For a Claude-based legal AI product, define three tiers:
| Tier | Models (examples) | Use case |
|---|---|---|
| Tier 1 (Primary) | claude-opus-4, claude-sonnet-4-6 | Complex drafting, legal analysis, multi-step reasoning, P0 skills |
| Tier 2 (Secondary) | claude-haiku-3-5, claude-sonnet-3-7 | Shorter form content, classification, routing, summarization |
| Tier 3 (Minimal) | Fast inference models, local models | Intent classification, short Q&A, health checks |
Assign each skill a minimum tier:
efirm-conflict-check,efirm-engagement-letter-draft: Tier 1 minimum (legal accuracy critical).efirm-client-update-email-draft,efirm-deadline-tracker: Tier 2 acceptable.- Routing/classification skills: Tier 3 acceptable.
Cascade sequence
For a given request:
1. Determine minimum_tier from skill configuration
2. Try primary_model for that tier
→ On success: use response
→ On 429 (rate limit): wait retry_delay; retry once; then cascade to next model
→ On 529 (overload): immediately cascade (no wait)
→ On 500/503: immediately cascade
→ On timeout (> latency_slo_ms + 2000): cascade
3. Try secondary_model (if tier permits)
→ Same error handling
4. If all models fail: return structured error (see below)
Cascade configuration (per skill)
skills:
efirm-conflict-check:
min_tier: 1
preferred_model: claude-sonnet-4-6
fallback_models:
- claude-opus-4 # higher capability fallback (if primary is overloaded)
- claude-haiku-3-5 # cost fallback only if budget exceeded
fallback_disclosure: true # Tell user which model was used
budget_fallback: true # Allow tier-2 if org over budget
efirm-deadline-tracker:
min_tier: 2
preferred_model: claude-haiku-3-5
fallback_models:
- claude-sonnet-4-6
fallback_disclosure: false
budget_fallback: false
Error handling
When all cascade options are exhausted:
{
"error": {
"code": "MODEL_UNAVAILABLE",
"message": "Legal AI is temporarily unavailable. All models in the cascade have returned errors.",
"user_message": "Our AI service is temporarily experiencing high demand. Your request has been queued. Please try again in 2–3 minutes, or contact support if this persists.",
"retry_after_seconds": 120,
"trace_id": "..."
}
}
Never expose raw API error messages to end users (no "Error 529 Overloaded" in the UI). Always translate to a professional, calm user message appropriate for a legal professional audience.
Fallback disclosure
When a lower-tier model is used as a fallback on a P0 skill (conflict check, engagement letter), the system should:
- Log the fallback event with:
{skill_id, preferred_model, actual_model, reason}. - Optionally display a subtle notice in the UI: "This response was generated by [Model X] (backup mode). Review carefully."
- Alert engineering if fallback rate exceeds the threshold (see [[eng-latency-slo-by-skill]]).
For P2/P3 skills, silent fallback is acceptable — the quality difference between tiers is smaller.
BYO key considerations
On BYO-key model: the user's API key may be rate-limited at a lower tier than the platform key. The cascade must:
- Detect 429 errors on the user's key.
- Surface a helpful message: "Your Anthropic API key has reached its rate limit for this model. Upgrade your Anthropic plan or wait [retry_after] seconds."
- Not silently fall back to a platform key (that would create unexpected cost for the platform).
Retry policy
| Error type | Retry | Wait |
|---|---|---|
| 429 Rate limit | Yes, once | Retry-After header value, or 60s |
| 529 Overload | No — cascade immediately | — |
| 500 Server error | Yes, once | 2s exponential backoff |
| 503 Unavailable | No — cascade immediately | — |
| Network timeout | Yes, once | 1s |
| Connection error | Yes, once | 1s |
Cap total retry + cascade time at the skill's latency SLO (see [[eng-latency-slo-by-skill]]). If the cascade would exceed the SLO, fail fast rather than timeout the user.
Monitoring
Track in metrics:
fallback_events_totalby{org_id, skill_id, from_model, to_model, reason}cascade_failure_rate(all models failed) — alert if > 0.1% of requestsfallback_rate_by_model— alert if primary model fallback rate > 5% over 15 min
Related skills
- [[eng-latency-slo-by-skill]]
- [[eng-cost-per-message-tracker]]
- [[eng-context-cache-key-design]]
- [[eng-feature-flag-rollout-skills]]