eng-feature-flag-rollout-skills
Rating is derived from the repo's GitHub stars and shown for reference.
name: eng-feature-flag-rollout-skills
description: Use when managing the rollout of new or updated legal AI skills using feature flags — controlling which organizations, users, or tenant tiers have access to a skill, enabling gradual rollout, A/B testing skill versions, and instant kill-switches for problematic skills. Engineering skill critical for safe deployment of legal AI capabilities without firm-wide disruption.
license: MIT
metadata:
id: eng.feature-flag-rollout-skills
category: eng
jurisdictions: [multi]
priority: P2
intent: [feature-flags, rollout, A/B-test, gradual-release, kill-switch]
related:
- eng-fallback-model-cascade
- eng-langfuse-eval-runner
- eng-audit-log-schema
- eng-latency-slo-by-skill
source: Louis — HAQQ Legal AI (github.com/sboghossian/mini-claude-for-legal)
version: "1.0"
Feature Flag Rollout — Skills
What it does
Feature flags for skills control which skills are available to which users, at which versions, under what conditions. In a legal AI product, skills are the primary feature surface — a new skill for drafting a Saudi employment contract, an updated conflict-check algorithm, or an experimental WIP report format. Feature flags allow:
- Safe rollout: ship a new skill to 5% of users before all users.
- Org-scoped enablement: a skill that requires firm-specific configuration (e.g., billing-system integration) can be enabled only for that firm.
- Version A/B testing: run two skill versions in parallel; evaluate quality via [[eng-langfuse-eval-runner]] before full promotion.
- Instant kill-switch: if a skill produces harmful or inaccurate legal output in production, disable it for all users in under 60 seconds.
- Tier-gated access: reserve advanced skills for pro/enterprise tier users.
Flag schema
Each skill flag is an entry in the flag configuration store:
{
"flag_id": "skill:efirm-conflict-check",
"skill_id": "efirm-conflict-check",
"skill_version": "1.1",
"enabled_global": true,
"rollout_percentage": 100,
"overrides": [
{
"condition": "org_id == 'org_haqq'",
"skill_version": "1.2-beta",
"rollout_percentage": 100,
"note": "HAQQ is piloting v1.2-beta"
},
{
"condition": "user_tier == 'free'",
"enabled": false,
"note": "Conflict check not available on free tier"
}
],
"kill_switch": false,
"created_at": "ISO-8601",
"updated_at": "ISO-8601",
"owner": "eng-team | product-team",
"review_date": "ISO-8601"
}
Evaluation logic
For every incoming request, the skill router evaluates flags in order:
1. If kill_switch == true: skill unavailable → return graceful error
2. If org_id matches an override: apply override (version + rollout%)
3. If user_tier matches an override: apply override
4. Else: apply global rollout_percentage (hash(user_id + flag_id) % 100 < rollout_percentage)
5. If skill available: load skill_version
Hash-based rollout ensures stability — the same user always gets the same variant within a rollout, avoiding inconsistent experiences across sessions.
Rollout stages for a new skill
| Stage | Rollout % | Duration | Promotion criteria |
|---|---|---|---|
| Internal only | 0% global; 100% for eng/product org_ids | 1 week | No errors; latency within SLO |
| Alpha | 5% of consenting opt-in users | 1 week | Quality score ≥ target per Langfuse eval |
| Beta | 20% | 1–2 weeks | Feedback positive; no critical flags |
| GA | 100% | Permanent | — |
For P0 skills (conflict check, engagement letter), the rollout stages are mandatory and cannot be skipped. For P2 skills, a single internal-only → GA progression is acceptable if quality is confirmed.
Version-parallel A/B testing
To compare two skill versions:
{
"flag_id": "skill:efirm-engagement-letter-draft",
"experiment": {
"enabled": true,
"variant_a": {"skill_version": "1.0", "traffic_pct": 50},
"variant_b": {"skill_version": "1.1-test", "traffic_pct": 50},
"eval_metric": "langfuse_score:quality",
"min_samples": 100,
"auto_promote_threshold": 0.05
}
}
Route users stably (same variant across sessions) using hash(user_id + experiment_id) % 100.
Evaluate using [[eng-langfuse-eval-runner]]: if variant B shows a statistically significant improvement (p < 0.05) after 100+ samples, auto-promote to 100% or alert the product team for manual promotion.
Kill-switch procedure
When a skill must be immediately disabled:
- Set
kill_switch: truein the flag store (change propagates in < 10s with a cache-busting mechanism). - All in-flight requests complete (no mid-stream interruption).
- New requests for that skill receive:
"This feature is temporarily unavailable. Please try again later or contact support." - Incident is logged with: who triggered the kill-switch, timestamp, reason.
- Engineering investigates root cause.
- To re-enable: set
kill_switch: falseafter fix is deployed and verified.
Never roll back a kill-switch without a fix — the reason it was triggered still applies.
Tier-gated skills
Map skill IDs to minimum tiers in configuration:
{
"efirm-conflict-check": "pro",
"efirm-engagement-letter-draft": "pro",
"efirm-fee-quote-builder": "pro",
"efirm-client-update-email-draft": "basic",
"efirm-deadline-tracker": "basic"
}
When a free-tier user attempts to invoke a pro-skill, the router returns a paywall response with an upgrade CTA — not an error.
Audit trail
All flag changes (create, update, kill_switch toggle) are recorded in the audit log ([[eng-audit-log-schema]]) with:
- Admin user ID who made the change
- Before and after state
- Timestamp
- Reason (free text)
Flag changes to P0 skills require a second admin approval before taking effect.
Related skills
- [[eng-fallback-model-cascade]]
- [[eng-langfuse-eval-runner]]
- [[eng-audit-log-schema]]
- [[eng-latency-slo-by-skill]]