Serverless Rate Limits and Fairness: Internal Handbook

Last updated: June 1, 2026

This is for

- how serverless limits work today

- how fairness and regulator interact

- how to diagnose 429 vs 503

- how to communicate the Tier 1-5 retirement correctly

---

1) Current product

Dynamic limits are the default. Serverless rate limits are dynamic:

- per organization

- per model

- based on recent successful usage plus live capacity conditions

This replaces old customer-facing static tier-style framing.

  • Tier labels are not the support model anymore

Do not frame normal serverless troubleshooting as "upgrade to Tier X."

Correct framing:

- check live headers

- shape traffic (steady vs bursty)

- use Batch for async high-volume

- use Dedicated for guaranteed throughput

---

2) Request outcomes: what they mean

  • 429 Too Many Requests

Means request/token pace exceeded dynamic allowance. Typical payload types:

- dynamic_request_limited

- dynamic_token_limited

  • 503 Service Unavailable

Can mean the request was not above dynamic allowance, but capacity could not serve it at that moment.

  • Practical takeaway

429 and 503 are not interchangeable.

You must inspect headers + payload type + telemetry before deciding root cause.

---

3) Rate-limit headers (source of truth)

Check these on every failing and successful sample:

- x-ratelimit-limit

- x-ratelimit-remaining

- x-ratelimit-reset

- x-tokenlimit-limit

- x-tokenlimit-remaining

- x-ratelimit-limit-dynamic

- x-ratelimit-remaining-dynamic

- x-tokenlimit-limit-dynamic

- x-tokenlimit-remaining-dynamic

Notes:

- Header sets can vary across requests/time.

- Do not debug from one stale screenshot.

- Use the exact headers from the request IDs in question.

---

4) Internal control layers (triage model)

Serverless behavior is best interpreted through three layers:

1. Static limiter (`enableRateLimiting`)

Often not the primary limiter for serverless.

2. Earned/dynamic limiter (`enableEarnedRateLimiting`)

Fairness-informed dynamic allowance.

3. Regulator

Queue/scheduling admission (`ALLOWED`, DENIED, EXPIRED) and timing behavior.

One ticket can show mixed effects across these layers.

---

5) Config parameters to audit during escalation

When analyzing model config in infra YAML, inspect:

### Fairness block

- fairness.enabled

- fairness.enforced

- fairness.utilization_threshold

- fairness.user_share_multiplier

### Regulator block

- regulator.enabled

- regulator.max_concurrency

- regulator.version

- regulator.in_queue_ttl

- regulator.score_adjustment_floor

### Org override block (critical)

- maxQpsUserOverrides[*].organizationId

- maxQpsUserOverrides[*].rate_limit_only

Risk pattern:

- missing rate_limit_only: true can create priority behavior that bypasses intended fairness enforcement.

---

## 6) Standard diagnostic flow (internal)

For any "rate limit / fairness / capacity" case:

1. Collect request IDs + UTC time window.

2. Compare headers from failing and successful requests.

3. Inspect 429 payload type.

4. Run check.py for path-level status/timing composition.

5. Run perf.py:

- model mode for regression and outliers

- org mode for fairness share and cross-org effects

6. Verify engine real-time utilization before claiming sustained overload.

7. Classify root cause:

- user pacing/burst pattern

- dynamic fairness pressure

- regulator queue behavior

- true platform capacity incident

- mixed

---

## 7) Live validation snapshot (May 11, 2026)

Model tested:

- Qwen/Qwen2.5-7B-Instruct-Turbo

Live API checks:

- successful chat calls returned x-ratelimit-* headers

- dynamic headers (`x-ratelimit-limit-dynamic`, x-ratelimit-remaining-dynamic) were observed

- a short 20-request burst returned all 200 (no 429 in that sample window)

Internal analysis runs:

- python3 check.py "Qwen/Qwen2.5-7B-Instruct-Turbo" --timespan "last 1h"

- python3 perf.py "Qwen/Qwen2.5-7B-Instruct-Turbo" --timespan "last 1h"

Artifacts:

- temp/serverless-owner/rate_limit_check_qwen2.5_1h.md

- temp/serverless-owner/rate_limit_perf_qwen2.5_1h.md

Observed highlights:

- high error share in that window was mostly 402 (billing/account state), not classic 429 saturation

- earned-limit markers existed in telemetry while model/user hard-limit flags were zero

- per-pod outliers existed without broad regulator-denied pattern

Interpretation:

- this is a strong example of why "high errors == rate limit issue" is often wrong.

---

## 8) Communication guidance (internal support)

When customers ask "what tier am I on?" or "how to get Tier 5":

Use:

- limits are dynamic per model/org

- sustained successful usage increases effective headroom

- bursty traffic is more likely to hit throttling

- guaranteed throughput requires dedicated capacity path

Avoid:

- promising tier upgrades as default

- quoting fixed universal RPM/TPM without current headers

Suggested short internal-safe wording:

"Serverless now uses dynamic per-model rate limits rather than fixed customer-facing tiers. The most accurate limits are in your API response headers. If you need guaranteed throughput, we can route you to dedicated capacity options."

---

## 9) Common mistakes to avoid

- treating all 503 as user over-limit

- assuming one header snapshot is globally valid

- skipping 429 payload subtype checks

- claiming sustained overload without utilization verification

- ignoring fairness override config during escalation

---

## 10) References