eng-fallback-model-cascade

Category: Design Risk: Medium risk ★ 3.9 · Rating 3.9/5 (8) sboghossian/mini-claude-for-legal MIT

Rating is derived from the repo's GitHub stars and shown for reference.

network_accessautomation_control

Download zip View source

name: eng-fallback-model-cascade
description: Use when designing the model-selection and fallback logic for a legal AI product — defining which model to use for which skill tier, how to cascade to a cheaper or faster model when the primary model is unavailable or over budget, and how to handle failures gracefully without exposing errors to legal practitioners. Engineering skill with direct impact on availability SLOs and cost management.
license: MIT
metadata:
id: eng.fallback-model-cascade
category: eng
jurisdictions: [multi]
priority: P2
intent: [fallback, model-selection, availability, resilience, cascade]
related:
- eng-latency-slo-by-skill
- eng-cost-per-message-tracker
- eng-context-cache-key-design
- eng-feature-flag-rollout-skills
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"

Fallback Model Cascade

What it does

A fallback model cascade is the routing logic that selects which LLM to use for a given request and what to do when the preferred model is unavailable, slow, or over the cost budget. In a legal AI product, where lawyers depend on reliable responses for deadline-critical work, a cascade must:

Serve the best model for the skill tier (heavier skills get the capable model; lightweight skills can run on a faster, cheaper model).
Degrade gracefully when the primary model returns a 529 (overload), 500, or timeout.
Stay within cost budgets if a tenant is on a metered plan.
Be transparent to the user when a fallback occurred — a legal practitioner may care that a "draft NDA" was produced by a less capable model.

Model tier definitions

For a Claude-based legal AI product, define three tiers:

Tier	Models (examples)	Use case
Tier 1 (Primary)	claude-opus-4, claude-sonnet-4-6	Complex drafting, legal analysis, multi-step reasoning, P0 skills
Tier 2 (Secondary)	claude-haiku-3-5, claude-sonnet-3-7	Shorter form content, classification, routing, summarization
Tier 3 (Minimal)	Fast inference models, local models	Intent classification, short Q&A, health checks

Assign each skill a minimum tier:

efirm-conflict-check, efirm-engagement-letter-draft: Tier 1 minimum (legal accuracy critical).
efirm-client-update-email-draft, efirm-deadline-tracker: Tier 2 acceptable.
Routing/classification skills: Tier 3 acceptable.

Cascade sequence

For a given request:
1. Determine minimum_tier from skill configuration
2. Try primary_model for that tier
   → On success: use response
   → On 429 (rate limit): wait retry_delay; retry once; then cascade to next model
   → On 529 (overload): immediately cascade (no wait)
   → On 500/503: immediately cascade
   → On timeout (> latency_slo_ms + 2000): cascade
3. Try secondary_model (if tier permits)
   → Same error handling
4. If all models fail: return structured error (see below)

Cascade configuration (per skill)

skills:
  efirm-conflict-check:
    min_tier: 1
    preferred_model: claude-sonnet-4-6
    fallback_models:
      - claude-opus-4          # higher capability fallback (if primary is overloaded)
      - claude-haiku-3-5       # cost fallback only if budget exceeded
    fallback_disclosure: true  # Tell user which model was used
    budget_fallback: true      # Allow tier-2 if org over budget

  efirm-deadline-tracker:
    min_tier: 2
    preferred_model: claude-haiku-3-5
    fallback_models:
      - claude-sonnet-4-6
    fallback_disclosure: false
    budget_fallback: false

Error handling

When all cascade options are exhausted:

{
  "error": {
    "code": "MODEL_UNAVAILABLE",
    "message": "Legal AI is temporarily unavailable. All models in the cascade have returned errors.",
    "user_message": "Our AI service is temporarily experiencing high demand. Your request has been queued. Please try again in 2–3 minutes, or contact support if this persists.",
    "retry_after_seconds": 120,
    "trace_id": "..."
  }
}

Never expose raw API error messages to end users (no "Error 529 Overloaded" in the UI). Always translate to a professional, calm user message appropriate for a legal professional audience.

Fallback disclosure

When a lower-tier model is used as a fallback on a P0 skill (conflict check, engagement letter), the system should:

Log the fallback event with: {skill_id, preferred_model, actual_model, reason}.
Optionally display a subtle notice in the UI: "This response was generated by [Model X] (backup mode). Review carefully."
Alert engineering if fallback rate exceeds the threshold (see [[eng-latency-slo-by-skill]]).

For P2/P3 skills, silent fallback is acceptable — the quality difference between tiers is smaller.

BYO key considerations

On BYO-key model: the user's API key may be rate-limited at a lower tier than the platform key. The cascade must:

Detect 429 errors on the user's key.
Surface a helpful message: "Your Anthropic API key has reached its rate limit for this model. Upgrade your Anthropic plan or wait [retry_after] seconds."
Not silently fall back to a platform key (that would create unexpected cost for the platform).

Retry policy

Error type	Retry	Wait
429 Rate limit	Yes, once	`Retry-After` header value, or 60s
529 Overload	No — cascade immediately	—
500 Server error	Yes, once	2s exponential backoff
503 Unavailable	No — cascade immediately	—
Network timeout	Yes, once	1s
Connection error	Yes, once	1s

Cap total retry + cascade time at the skill's latency SLO (see [[eng-latency-slo-by-skill]]). If the cascade would exceed the SLO, fail fast rather than timeout the user.

Monitoring

Track in metrics:

fallback_events_total by {org_id, skill_id, from_model, to_model, reason}
cascade_failure_rate (all models failed) — alert if > 0.1% of requests
fallback_rate_by_model — alert if primary model fallback rate > 5% over 15 min

[[eng-latency-slo-by-skill]]
[[eng-cost-per-message-tracker]]
[[eng-context-cache-key-design]]
[[eng-feature-flag-rollout-skills]]