Serverless Rate Limits and Fairness: Internal Handbook

Last updated: June 1, 2026

This is for

- how serverless limits work today

- how fairness and regulator interact

- how to diagnose 429 vs 503

- how to communicate the Tier 1-5 retirement correctly

---

1) Current product

Dynamic limits are the default. Serverless rate limits are dynamic:

- per organization

- per model

- based on recent successful usage plus live capacity conditions

This replaces old customer-facing static tier-style framing.

Tier labels are not the support model anymore

Do not frame normal serverless troubleshooting as "upgrade to Tier X."

Correct framing:

- check live headers

- shape traffic (steady vs bursty)

- use Batch for async high-volume

- use Dedicated for guaranteed throughput

---

2) Request outcomes: what they mean

429 Too Many Requests

Means request/token pace exceeded dynamic allowance. Typical payload types:

- dynamic_request_limited

- dynamic_token_limited

503 Service Unavailable

Can mean the request was not above dynamic allowance, but capacity could not serve it at that moment.

Practical takeaway

429 and 503 are not interchangeable.

You must inspect headers + payload type + telemetry before deciding root cause.

---

3) Rate-limit headers (source of truth)

Check these on every failing and successful sample:

- x-ratelimit-limit

- x-ratelimit-remaining

- x-ratelimit-reset

- x-tokenlimit-limit

- x-tokenlimit-remaining

- x-ratelimit-limit-dynamic

- x-ratelimit-remaining-dynamic

- x-tokenlimit-limit-dynamic

- x-tokenlimit-remaining-dynamic

Notes:

- Header sets can vary across requests/time.

- Do not debug from one stale screenshot.

- Use the exact headers from the request IDs in question.

---

4) Internal control layers (triage model)

Serverless behavior is best interpreted through three layers:

1. Static limiter (`enableRateLimiting`)

Often not the primary limiter for serverless.

2. Earned/dynamic limiter (`enableEarnedRateLimiting`)

Fairness-informed dynamic allowance.

3. Regulator

Queue/scheduling admission (`ALLOWED`, DENIED, EXPIRED) and timing behavior.

One ticket can show mixed effects across these layers.

---

5) Config parameters to audit during escalation

When analyzing model config in infra YAML, inspect:

### Fairness block

- fairness.enabled

- fairness.enforced

- fairness.utilization_threshold

- fairness.user_share_multiplier

### Regulator block

- regulator.enabled

- regulator.max_concurrency

- regulator.version

- regulator.in_queue_ttl

- regulator.score_adjustment_floor

### Org override block (critical)

- maxQpsUserOverrides[*].organizationId

- maxQpsUserOverrides[*].rate_limit_only

Risk pattern:

- missing rate_limit_only: true can create priority behavior that bypasses intended fairness enforcement.

---

## 6) Standard diagnostic flow (internal)

For any "rate limit / fairness / capacity" case:

1. Collect request IDs + UTC time window.

2. Compare headers from failing and successful requests.

3. Inspect 429 payload type.

4. Run check.py for path-level status/timing composition.

5. Run perf.py:

- model mode for regression and outliers

- org mode for fairness share and cross-org effects

6. Verify engine real-time utilization before claiming sustained overload.

7. Classify root cause:

- user pacing/burst pattern

- dynamic fairness pressure

- regulator queue behavior

- true platform capacity incident

- mixed

---

## 7) Live validation snapshot (May 11, 2026)

Model tested:

- Qwen/Qwen2.5-7B-Instruct-Turbo

Live API checks:

- successful chat calls returned x-ratelimit-* headers

- dynamic headers (`x-ratelimit-limit-dynamic`, x-ratelimit-remaining-dynamic) were observed

- a short 20-request burst returned all 200 (no 429 in that sample window)

Internal analysis runs:

- python3 check.py "Qwen/Qwen2.5-7B-Instruct-Turbo" --timespan "last 1h"

- python3 perf.py "Qwen/Qwen2.5-7B-Instruct-Turbo" --timespan "last 1h"

Artifacts:

- temp/serverless-owner/rate_limit_check_qwen2.5_1h.md

- temp/serverless-owner/rate_limit_perf_qwen2.5_1h.md

Observed highlights:

- high error share in that window was mostly 402 (billing/account state), not classic 429 saturation

- earned-limit markers existed in telemetry while model/user hard-limit flags were zero

- per-pod outliers existed without broad regulator-denied pattern

Interpretation:

- this is a strong example of why "high errors == rate limit issue" is often wrong.

---

## 8) Communication guidance (internal support)

When customers ask "what tier am I on?" or "how to get Tier 5":

Use:

- limits are dynamic per model/org

- sustained successful usage increases effective headroom

- bursty traffic is more likely to hit throttling

- guaranteed throughput requires dedicated capacity path

Avoid:

- promising tier upgrades as default

- quoting fixed universal RPM/TPM without current headers

Suggested short internal-safe wording:

"Serverless now uses dynamic per-model rate limits rather than fixed customer-facing tiers. The most accurate limits are in your API response headers. If you need guaranteed throughput, we can route you to dedicated capacity options."

---

## 9) Common mistakes to avoid

- treating all 503 as user over-limit

- assuming one header snapshot is globally valid

- skipping 429 payload subtype checks

- claiming sustained overload without utilization verification

- ignoring fairness override config during escalation

---

## 10) References