eng-latency-slo-by-skill

Category: Coding Risk: High risk ★ 3.9 · Rating 3.9/5 (8) sboghossian/mini-claude-for-legal MIT

Rating is derived from the repo's GitHub stars and shown for reference.

shell_executionnetwork_accessautomation_control

name: eng-latency-slo-by-skill
description: Use when defining, measuring, or debugging latency Service Level Objectives (SLOs) for individual legal AI skills. Different skills have different acceptable latency — a conflict check must be fast; a full engagement letter draft can take longer. Covers the SLO definition framework, measurement approach, p50/p95/p99 targets by skill tier, alerting setup, and common latency root causes in legal AI products.
license: MIT
metadata:
id: eng.latency-slo-by-skill
category: eng
jurisdictions: [multi]
priority: P2
intent: [SLO, latency, performance, observability, p99, alerting]
related:
- eng-langfuse-trace-inspector
- eng-fallback-model-cascade
- eng-cost-per-message-tracker
- eng-context-cache-key-design
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"

Latency SLO by Skill

What it does

A latency SLO (Service Level Objective) defines the acceptable response time for a skill. Without per-skill SLOs, the only latency metric is "it felt slow" — which cannot drive engineering action. With per-skill SLOs:

  • Product knows the experience guarantee for each skill.
  • Engineering has a measurable target and an alert threshold.
  • The [[eng-fallback-model-cascade]] can use SLO breach as a trigger for model switching.
  • Users — law firm staff — understand why some skills stream longer than others.

SLO taxonomy

Latency is measured at the skill level as end-to-end: from the moment the user's message is received by the API gateway to the moment the first response byte is sent (TTFB) and to the moment the full response is complete (total).

Metric Definition
p50_ttfb Median time to first byte — 50% of requests faster
p95_ttfb 95th percentile TTFB — the "most users" experience
p99_ttfb 99th percentile TTFB — worst 1% of requests
p95_total 95th percentile total response time
error_rate % of requests returning an error

SLO targets by skill tier

Assign each skill to a latency tier based on expected usage context:

Tier Description p50 TTFB p95 TTFB p99 TTFB p95 total
Interactive (T-I) Real-time chat; user is watching <500ms <1.5s <3s <8s
Workflow (T-W) Triggered by action; user expects slight wait <1s <3s <6s <20s
Document (T-D) Full document generation; user expects wait; streaming visible <2s <5s <10s <60s
Batch (T-B) Background job; user not waiting at terminal <10s <30s <60s <300s

Skill-to-tier assignments

Skill Tier Rationale
router.* T-I Must not add noticeable latency to user experience
efirm-conflict-check T-W Triggered by matter creation; partner expects fast result
efirm-client-update-email-draft T-W Lawyer requesting a draft; will review before sending
efirm-deadline-tracker T-W Matter dashboard; should load within a workflow step
efirm-engagement-letter-draft T-D Full document; streaming acceptable
efirm-fee-quote-builder T-D Multi-section document output
efirm-matter-creation-flow T-W Orchestration step; sub-skills have own SLOs
efirm-finance-* (dashboards) T-D Report generation
Eval runs T-B Background quality evaluation

Measurement approach

Instrumentation

At every skill invocation, emit:

langfuse.span(
    name=f"skill.{skill_id}",
    metadata={
        "skill_id": skill_id,
        "tier": skill_tier,
        "model_id": model_id,
        "cache_hit_l1": bool,
        "cache_hit_l2": bool,
        "cache_hit_l3": bool,
        "input_tokens": n,
        "output_tokens": n
    },
    start_time=request_start,
    end_time=response_complete,
    level="DEFAULT" if within_slo else "WARNING"
)

# Also emit a structured metrics event
metrics.histogram(
    "skill.latency.ttfb",
    value=ttfb_ms,
    tags={"skill_id": skill_id, "tier": tier, "model_id": model_id, "cache_hit_l2": str(cache_hit)}
)

Percentile computation

Compute p50/p95/p99 over rolling 1-hour and 24-hour windows per skill_id. Use a histogram with power-of-two buckets for efficient percentile estimation.

SLO alerting

Alert Condition Severity
TTFB SLO breach (T-I) p95_ttfb > 1.5s for 5 consecutive minutes HIGH — wake on-call
TTFB SLO breach (T-W/T-D) p95_ttfb > 3x target for 10 min MEDIUM — notify eng channel
Error rate spike Error rate > 1% over 5 min HIGH
p99 outlier p99_ttfb > 4x p50_ttfb LOW — investigate cache or model issue
Total response time exceeded p95_total > 2x target for 5 min MEDIUM

Alerts feed into the [[eng-fallback-model-cascade]] kill-switch: if a skill's p95 TTFB exceeds the T-W threshold and the primary model is at fault, automatically cascade to the secondary model.

Common latency root causes

Root cause Symptom Resolution
Cache miss at L2 (skills) p95 TTFB spike after skill deployment Warm the cache by sending a low-stakes prefill request after deployment
Large matter context (L3) p95 TTFB proportional to matter age/size Implement matter-context summarization; truncate least-relevant documents
Model overload / 529 p99 spike; fallback_used rate rises Cascade to secondary model; alert Anthropic
Long output generation p95_total high despite low TTFB Expected for T-D skills; verify streaming is enabled so user sees progress
Context assembly bottleneck High latency on context.assembly span in Langfuse Optimize KB retrieval; profile the assembly code
Cold start (serverless) p99 outlier on first request after idle Keep-alive ping or minimum instance count

SLO reporting

Produce a weekly SLO report:

SLO REPORT — [Period]

Skill                            | Tier | p50 TTFB | p95 TTFB | SLO Target | Status
─────────────────────────────────────────────────────────────────────────────────
efirm-conflict-check             | T-W  | 820ms    | 2.1s     | 3.0s       | ✓ OK
efirm-engagement-letter-draft    | T-D  | 1.4s     | 4.8s     | 5.0s       | ✓ OK
efirm-fee-quote-builder          | T-D  | 1.9s     | 6.2s     | 5.0s       | ✗ BREACH
router.*                         | T-I  | 210ms    | 480ms    | 1.5s       | ✓ OK

SLO breaches this period: 1 (efirm-fee-quote-builder)
Root cause: Large output + context explosion on complex matters
Remediation: Streaming enabled [date]; p95 expected to improve next period
  • [[eng-langfuse-trace-inspector]]
  • [[eng-fallback-model-cascade]]
  • [[eng-cost-per-message-tracker]]
  • [[eng-context-cache-key-design]]