GPT, embeddings, and image generation APIs
Why OpenAI returns 429
OpenAI enforces strict upstream quotas. When your integration exceeds permitted throughput, the API responds with HTTP 429 and rate-limit headers that signal how long to wait before retrying.
- Burst traffic from batch inference jobs exhausts token limits mid-request.
- Parallel agent loops hammer chat/completions without backoff.
- Missing retry-after handling causes cascading 429 storms.
OpenAI rate-limit headers
Monitor these response headers to detect quota exhaustion before users see failures. Official reference: OpenAI rate limit documentation.
| Header | Meaning |
|---|---|
x-ratelimit-limit-requests | Maximum number of requests permitted in the current rate-limit window. |
x-ratelimit-remaining-requests | Number of requests remaining before the limit is enforced. |
x-ratelimit-reset-requests | Time until the request quota resets (typically expressed as a duration string). |
x-ratelimit-limit-tokens | Maximum tokens allowed in the current window for token-based models. |
x-ratelimit-remaining-tokens | Remaining token budget before throttling occurs. |
retry-after | Seconds to wait before retrying after a 429 response. |
Default OpenAI limits
- Requests per minute
- 500
- Requests per day
- 10,000
- Throttle status code
- 429
Drop-in fix with ThrottleProxy
Route OpenAI traffic through proxy.throttleproxy.com to enforce workspace-level RPM ceilings, queue bursty workloads, and absorb retries before they hit your upstream quota.
curl "https://proxy.throttleproxy.com/proxy?target=https://api.example.com/v1/resource" \ -H "Authorization: Bearer YOUR_API_KEY"
ThrottleProxy queues eligible bursts, enforces workspace RPM ceilings, and surfaces privacy-safe request activity so teams can understand pressure before increasing traffic to OpenAI.
Start Setup →