Why am I getting 429 on Serverless, and how long does it last?

Last updated: May 11, 2026

If you see HTTP 429, your request rate is temporarily above the model's effective limit for your account at that moment.

What 429 means

- 429 = rate limiting (too many requests/tokens in the current window)

- It is different from:

- 503 = service/model temporarily unavailable

- 500 = internal server error

- 402 = billing/account state issue

There is no single fixed cooldown for all serverless 429s.

Recovery time depends on:

- how bursty traffic is

- how far traffic is above the current limit

- model demand/capacity at that time

- whether retries use proper backoff

In many cases, it improves quickly once traffic is smoothed and retries are spaced.

Use response headers as the source of truth for each request:

- x-ratelimit-*

- x-tokenlimit-*

- dynamic variants when present

These values can change over time based on usage and capacity.

1. Smooth request bursts (avoid sudden concurrency spikes).

2. Use exponential backoff + jitter on retries.

3. Respect Retry-After header when present.

4. Monitor x-ratelimit-* and x-tokenlimit-* on both success and 429 responses.

5. For steady guaranteed throughput, consider Dedicated Endpoints.

References