Serverless Rate Limits and Fairness: Internal Handbook
Last updated: June 1, 2026
This is for
- how serverless limits work today
- how fairness and regulator interact
- how to diagnose 429 vs 503
- how to communicate the Tier 1-5 retirement correctly
---
1) Current product
Dynamic limits are the default. Serverless rate limits are dynamic:
- per organization
- per model
- based on recent successful usage plus live capacity conditions
This replaces old customer-facing static tier-style framing.
Tier labels are not the support model anymore
Do not frame normal serverless troubleshooting as "upgrade to Tier X."
Correct framing:
- check live headers
- shape traffic (steady vs bursty)
- use Batch for async high-volume
- use Dedicated for guaranteed throughput
---
2) Request outcomes: what they mean
429 Too Many Requests
Means request/token pace exceeded dynamic allowance. Typical payload types:
- dynamic_request_limited
- dynamic_token_limited
503 Service Unavailable
Can mean the request was not above dynamic allowance, but capacity could not serve it at that moment.
Practical takeaway
429 and 503 are not interchangeable.
You must inspect headers + payload type + telemetry before deciding root cause.
---
3) Rate-limit headers (source of truth)
Check these on every failing and successful sample:
- x-ratelimit-limit
- x-ratelimit-remaining
- x-ratelimit-reset
- x-tokenlimit-limit
- x-tokenlimit-remaining
- x-ratelimit-limit-dynamic
- x-ratelimit-remaining-dynamic
- x-tokenlimit-limit-dynamic
- x-tokenlimit-remaining-dynamic
Notes:
- Header sets can vary across requests/time.
- Do not debug from one stale screenshot.
- Use the exact headers from the request IDs in question.
---
4) Internal control layers (triage model)
Serverless behavior is best interpreted through three layers:
1. Static limiter (`enableRateLimiting`)
Often not the primary limiter for serverless.
2. Earned/dynamic limiter (`enableEarnedRateLimiting`)
Fairness-informed dynamic allowance.
3. Regulator
Queue/scheduling admission (`ALLOWED`, DENIED, EXPIRED) and timing behavior.
One ticket can show mixed effects across these layers.
---
5) Config parameters to audit during escalation
When analyzing model config in infra YAML, inspect:
### Fairness block
- fairness.enabled
- fairness.enforced
- fairness.utilization_threshold
- fairness.user_share_multiplier
### Regulator block
- regulator.enabled
- regulator.max_concurrency
- regulator.version
- regulator.in_queue_ttl
- regulator.score_adjustment_floor
### Org override block (critical)
- maxQpsUserOverrides[*].organizationId
- maxQpsUserOverrides[*].rate_limit_only
Risk pattern:
- missing rate_limit_only: true can create priority behavior that bypasses intended fairness enforcement.
---
## 6) Standard diagnostic flow (internal)
For any "rate limit / fairness / capacity" case:
1. Collect request IDs + UTC time window.
2. Compare headers from failing and successful requests.
3. Inspect 429 payload type.
4. Run check.py for path-level status/timing composition.
5. Run perf.py:
- model mode for regression and outliers
- org mode for fairness share and cross-org effects
6. Verify engine real-time utilization before claiming sustained overload.
7. Classify root cause:
- user pacing/burst pattern
- dynamic fairness pressure
- regulator queue behavior
- true platform capacity incident
- mixed
---
## 7) Live validation snapshot (May 11, 2026)
Model tested:
- Qwen/Qwen2.5-7B-Instruct-Turbo
Live API checks:
- successful chat calls returned x-ratelimit-* headers
- dynamic headers (`x-ratelimit-limit-dynamic`, x-ratelimit-remaining-dynamic) were observed
- a short 20-request burst returned all 200 (no 429 in that sample window)
Internal analysis runs:
- python3 check.py "Qwen/Qwen2.5-7B-Instruct-Turbo" --timespan "last 1h"
- python3 perf.py "Qwen/Qwen2.5-7B-Instruct-Turbo" --timespan "last 1h"
Artifacts:
- temp/serverless-owner/rate_limit_check_qwen2.5_1h.md
- temp/serverless-owner/rate_limit_perf_qwen2.5_1h.md
Observed highlights:
- high error share in that window was mostly 402 (billing/account state), not classic 429 saturation
- earned-limit markers existed in telemetry while model/user hard-limit flags were zero
- per-pod outliers existed without broad regulator-denied pattern
Interpretation:
- this is a strong example of why "high errors == rate limit issue" is often wrong.
---
## 8) Communication guidance (internal support)
When customers ask "what tier am I on?" or "how to get Tier 5":
Use:
- limits are dynamic per model/org
- sustained successful usage increases effective headroom
- bursty traffic is more likely to hit throttling
- guaranteed throughput requires dedicated capacity path
Avoid:
- promising tier upgrades as default
- quoting fixed universal RPM/TPM without current headers
Suggested short internal-safe wording:
"Serverless now uses dynamic per-model rate limits rather than fixed customer-facing tiers. The most accurate limits are in your API response headers. If you need guaranteed throughput, we can route you to dedicated capacity options."
---
## 9) Common mistakes to avoid
- treating all 503 as user over-limit
- assuming one header snapshot is globally valid
- skipping 429 payload subtype checks
- claiming sustained overload without utilization verification
- ignoring fairness override config during escalation
---
## 10) References