Back to BlogCloud

Stop Treating Webhooks Like Durable Events

A practical architecture for inbound webhooks: durable intake, idempotent workers, replay tooling, and why your callback should not run business logic inline.

webhooksqueuesidempotencysystem-designobservabilityevent-driven
Stop Treating Webhooks Like Durable Events

Stop treating webhooks like durable events

One of the most common backend mistakes is also one of the most boring: a team receives a webhook, does the real business logic inside the HTTP handler, returns 200, and calls it architecture.

It works right up until it doesn’t.

Then you get the fun version of distributed systems: duplicate shipments, subscriptions stuck in pending, emails sent twice, and support tickets that say "Stripe says paid, your app says unpaid." The uncomfortable truth is that a webhook is just an inbound HTTP callback. It is not a durable event log, not an ordered stream, and definitely not a transaction boundary.

Stripe says this pretty plainly: deliveries can be retried, duplicate events can happen, and event ordering is not guaranteed. Svix makes the same point: even if a sender emits events in order, your own handler latency can reorder processing in practice.

If your code assumes otherwise, production will correct you.

The realistic mistake

The usual shape looks like this:

  • Verify the webhook signature

  • Parse the payload

  • Update your database

  • Call two internal services

  • Send an email

  • Maybe publish a Kafka or SQS event

  • Return 200

That seems efficient. It is actually a pile of hidden coupling.

You’ve tied your external API boundary to internal side effects, queue availability, downstream latency, and deploy timing. If anything in that chain is slow or flaky, the sender retries. Now your handler runs twice. If the first attempt partially succeeded, the second attempt compounds the damage.

Worse, if you update your database and then publish an internal event in a separate step, you’ve recreated the classic dual-write problem. AWS’s transactional outbox guidance exists because this breaks constantly: database commit succeeds, event publish fails, downstream systems never hear about it.

The second bad assumption is ordering. Stripe explicitly recommends fetching missing objects from the API if events arrive out of order. That’s your hint that the webhook payload is not the source of truth. It is a change notification.

The architecture that survives reality

Split webhook handling into two systems: ingest and process.

The intake path should do very little:

  • Verify signature

  • Extract provider, account_id, event_id, tenant_id, type, received_at

  • Persist the raw payload durably in an inbox table or append-only store

  • Enqueue lightweight work

  • Return 2xx fast

That durable write is the real boundary. Not your controller code.

A minimal inbox table is usually enough:

create table webhook_inbox (
  id bigserial primary key,
  provider text not null,
  provider_account_id text not null,
  tenant_id text not null,
  event_id text not null,
  event_type text not null,
  payload jsonb not null,
  status text not null default 'received',
  received_at timestamptz not null default now(),
  processed_at timestamptz,
  unique (provider, provider_account_id, event_id)
);

That unique key matters. In multi-tenant systems, dedupe keys must include the provider account or tenant routing context, not just event_id. Otherwise you eventually build a cross-tenant bug with better branding.

The worker path does the real work:

  • Load the inbox record

  • If needed, fetch the latest provider object by ID

  • Apply an idempotent state transition in your DB transaction

  • Mark the inbox row processed or failed

  • Emit internal domain events through an outbox, not a direct broker publish

This is where the idempotent consumer pattern stops being architecture-book advice and starts being payroll advice. If a message queue or webhook sender is at-least-once, your consumer has to be safe when invoked more than once.

And yes, your queue is also at-least-once. SQS says so explicitly. Visibility timeouts and DLQs help with recovery, but they do not magically create exactly-once processing.

What “idempotent” actually means here

A lot of teams say “our handler is idempotent” when they really mean “we hope duplicate requests are rare.” That is not the same thing.

In webhook systems, idempotency usually means:

  • Reprocessing the same event does not produce another business mutation

  • Side effects are keyed, recorded, or converted to upserts

  • State transitions are monotonic where possible

  • Older events do not overwrite newer state

For example, don’t let invoice.payment_failed move an order backward if you already applied invoice.paid after fetching the latest invoice state. Compare timestamps or version counters. Better yet, treat the webhook as a prompt to reconcile current truth from the provider API for critical objects.

That last part matters for consistency and rate limits. If you fetch full provider state on every event, you can create your own secondary outage by hammering the upstream API. So be selective:

  • Fetch for money, entitlements, and irreversible actions

  • Trust payloads for low-risk analytics or notifications

  • Rate-limit worker concurrency per provider account

  • Add jittered retries for provider API calls

If a flow becomes multi-step, long-running, or needs timers and human approval, stop stretching queue consumers into homemade orchestration. Use durable execution tooling there. The queue is for delivery and buffering, not for pretending your business process is a finite state machine made of retries.

The observability most teams skip

The painful part of webhook failures is not delivery failure. It’s delivery success with business failure.

A 200 only means your endpoint acknowledged receipt. It does not mean the customer got access, the invoice applied, or the internal event propagated.

You want first-class states like:

  • received
  • processing
  • processed
  • deferred
  • failed
  • dead_lettered

Then measure:

  • Ack latency

  • Queue lag

  • Duplicate rate

  • Processing latency

  • DLQ depth

  • Replay count

  • Per-tenant failure rate

  • Reconciliation drift

Also build replay on day one. Stripe supports manual resends and retries undelivered events for up to three days in many cases, but your system still needs internal replay and auditability for when *your* worker failed after intake. Their docs on undelivered events are worth reading because they push you toward explicit processing status, not blind re-execution.

The tradeoffs

This architecture is better, not free.

  • You add storage, a queue, workers, and ops overhead

  • You accept eventual consistency instead of synchronous illusion

  • You need replay tooling and retention policies for raw payloads

  • You need to think about tenant routing, schema evolution, and PII handling

  • You may need reconciliation jobs for high-value domains

That is still cheaper than debugging "paid but not provisioned" from logs that only show 200 OK.

The opinionated rule is simple: a webhook should enter your system through a narrow, durable boundary, then get processed asynchronously by idempotent workers with observable state. If you mutate business state directly in the request handler, you are not building an event-driven system. You are just hiding distributed systems failure modes inside a controller.

That approach always looks simple right before it starts charging interest.

Further reading