Your Queue Is Not a Rate Limiter: The Architecture Multi-Tenant Integrations Actually Need

A shared worker queue feels sufficient until one tenant floods it, retries pile up, and a vendor starts returning 429s. The fix is a scheduler that owns per-tenant budgets, retries, and idempotency instead of hoping the queue will do policy for you.

Your queue is doing the wrong job

A very normal SaaS mistake goes like this:

You have a multi-tenant app. Customers connect Salesforce, HubSpot, Stripe, Slack, whatever. An API request creates some sync work, you push jobs into a shared queue, and a fleet of workers calls the vendor API.

It looks clean. It is also the start of a very predictable incident.

One tenant imports 400k records. Another has a bad config that makes every call return 429. Retries start stacking. Workers stay busy on the loud tenant. Quiet tenants, who did nothing wrong, suddenly get "sync delayed" banners and support tickets.

The mistake is subtle: you treated the queue as both a backlog buffer and the policy engine for fairness, rate limits, and retries.

Queues are good at holding work. They are not automatically good at deciding which tenant should spend the next unit of downstream API budget.

The failure mode shows up late

This architecture usually survives staging and small traffic just fine.

It fails when volume becomes uneven.

What actually goes wrong

A shared queue hides tenant boundaries, so the noisiest tenant tends to dominate worker time.
Retries compete with first attempts for the same worker pool.
Vendor throttling is rarely global. It is often scoped by account, app installation, region, or endpoint. Your workers usually do not model that scope explicitly.
A queue-level rate limit helps cap total dispatch, but it does not tell you which tenant should consume the next token.
If your outbound call is not idempotent, retried jobs can duplicate side effects.
Observability gets blurry because you only see queue depth and worker CPU, not "tenant A has consumed 90% of the Stripe budget for connected account X."

This is why the latest queue features are helpful but not magic. Amazon SQS fair queues explicitly target noisy-neighbor dwell time in multi-tenant queues, which is good. But even AWS documents that fair queues do not enforce per-tenant consumption limits. They improve fairness of delivery, not your business-specific rate-limit policy.

Similarly, Cloud Tasks queue rate limits cap dispatch for a queue, including retries. Useful, but still queue-scoped. If tenant 12 should get 2 requests per second and tenant 847 should get 20, the queue does not know that unless you model it.

The better architecture: separate policy from execution

The fix is boring, which is why it works.

Keep the queue, but demote it. The queue is transportation. A scheduler owns policy.

The shape I prefer

1. The request path records intent in your database. 2. An outbox or event stream publishes "sync requested" or "invoice export requested." 3. A scheduler groups pending work by the real throttling key. 4. The scheduler grants execution leases only when a tenant or integration budget allows it. 5. Workers do one small unit of work, report the result, and never invent retry policy on their own. 6. Long waits, cooldowns, and multi-step retries live in durable execution, not in in-memory timers inside workers.

That throttling key matters a lot. It is usually not just tenant_id.

It might be:

```
tenant_id + vendor
```
```
tenant_id + vendor_account_id
```
region + table style resource scopes, the same kind of nuance AWS calls out in its adaptive retry mode docs
tenant_id + endpoint family if one API has separate budgets for search vs writes

Once you have the right key, you can keep a token bucket or credit counter per key. AWS uses token buckets in SDK retry behavior for a reason: they are simple, effective, and easy to reason about under overload. But the important architectural move is not "use token bucket." It is "put budget ownership in one place."

What the scheduler is responsible for

A real scheduler in this setup does four jobs.

1. Fairness

It decides which tenant gets the next execution slot. That can be weighted fair sharing, round-robin across active tenants, or priority classes if enterprise customers buy different guarantees.

2. Rate-limit compliance

It knows the vendor budget and spend rate for each throttling key. If tenant A is cooling down after 429, tenant B should keep flowing.

3. Retry orchestration

Retries should not be "throw it back in the queue and pray."

Use capped exponential backoff with jitter. AWS has been saying this forever in the Builders Library, and they are right. More importantly, retries should re-enter the scheduler as delayed work, not bypass it.

4. Idempotency and result recording

If the downstream API supports idempotency keys, use them. If it does not, create your own dedupe contract around the side effect. AWS's article on making retries safe with idempotent APIs is still the right mental model here.

A worker should be able to crash after sending the request but before saving success, and your system should still converge without double-applying side effects.

Where edge and serverless fit

Edge and serverless are fine on the ingestion side.

Accept the request at the edge. Authenticate it there. Maybe do lightweight validation there. But do not let an edge function become the source of truth for rate-limit state, retry timers, or cross-tenant fairness.

Those concerns need durable coordination. That usually means a regional database plus workers, or a workflow engine. If your retry gaps are seconds to hours, a durable execution system like Temporal is often cleaner than hand-rolling timers, visibility timeouts, dead-letter queues, and "resume from step 4" logic across five services.

Not every team needs Temporal. But every team with long-running, failure-prone integrations needs the properties Temporal is selling: persisted workflow state, retries, timers, and visibility.

The observability you actually need

If all you watch is queue depth, you will learn about this architecture too late.

Track these instead:

per-tenant pending jobs
per-tenant dwell time
per-tenant success, retry, and terminal failure rates
vendor 429 rate by throttling key
time spent in cooldown
scheduler grant rate versus worker completion rate
idempotency conflicts or dedupe hits

This is the difference between "the queue is backed up" and "one tenant's Shopify write budget is exhausted, but all other tenants are healthy."

That second sentence is operationally useful.

Tradeoffs

This architecture is better, not free.

You are adding a scheduler service or at least scheduler logic backed by durable state.
You now have to model vendor-specific throttling scopes instead of pretending they are all the same.
Fairness policy becomes product policy. Someone has to decide whether enterprise tenants get bigger buckets or separate lanes.
You may trade a little peak throughput for predictability and blast-radius control.

That last trade is usually worth it. Customers do not care that one giant tenant can consume every worker at maximum efficiency. They care that their own sync finished on time.

The rule of thumb

If your system calls third-party APIs on behalf of many tenants, do not let a shared queue decide who gets to spend downstream budget.

Use the queue to hold work.

Use a scheduler to decide when work is allowed to run.

That boundary sounds small, but it is the difference between a platform that degrades gracefully and one that turns every noisy neighbor into a company-wide incident.