Stop Treating Webhooks Like Durable Events
A practical architecture for inbound webhooks: durable intake, idempotent workers, replay tooling, and why your callback should not run business logic inline.
Stop treating webhooks like durable events
One of the most common backend mistakes is also one of the most boring: a team receives a webhook, does the real business logic inside the HTTP handler, returns 200, and calls it architecture.
It works right up until it doesn’t.
Then you get the fun version of distributed systems: duplicate shipments, subscriptions stuck in pending, emails sent twice, and support tickets that say "Stripe says paid, your app says unpaid." The uncomfortable truth is that a webhook is just an inbound HTTP callback. It is not a durable event log, not an ordered stream, and definitely not a transaction boundary.
Stripe says this pretty plainly: deliveries can be retried, duplicate events can happen, and event ordering is not guaranteed. Svix makes the same point: even if a sender emits events in order, your own handler latency can reorder processing in practice.
If your code assumes otherwise, production will correct you.
The realistic mistake
The usual shape looks like this:
Verify the webhook signature
Parse the payload
Update your database
Call two internal services
Send an email
Maybe publish a Kafka or SQS event
Return
200
That seems efficient. It is actually a pile of hidden coupling.
You’ve tied your external API boundary to internal side effects, queue availability, downstream latency, and deploy timing. If anything in that chain is slow or flaky, the sender retries. Now your handler runs twice. If the first attempt partially succeeded, the second attempt compounds the damage.
Worse, if you update your database and then publish an internal event in a separate step, you’ve recreated the classic dual-write problem. AWS’s transactional outbox guidance exists because this breaks constantly: database commit succeeds, event publish fails, downstream systems never hear about it.
The second bad assumption is ordering. Stripe explicitly recommends fetching missing objects from the API if events arrive out of order. That’s your hint that the webhook payload is not the source of truth. It is a change notification.
The architecture that survives reality
Split webhook handling into two systems: ingest and process.
The intake path should do very little:
Verify signature
Extract
provider,account_id,event_id,tenant_id,type,received_atPersist the raw payload durably in an
inboxtable or append-only storeEnqueue lightweight work
Return
2xxfast
That durable write is the real boundary. Not your controller code.
A minimal inbox table is usually enough:
create table webhook_inbox (
id bigserial primary key,
provider text not null,
provider_account_id text not null,
tenant_id text not null,
event_id text not null,
event_type text not null,
payload jsonb not null,
status text not null default 'received',
received_at timestamptz not null default now(),
processed_at timestamptz,
unique (provider, provider_account_id, event_id)
);That unique key matters. In multi-tenant systems, dedupe keys must include the provider account or tenant routing context, not just event_id. Otherwise you eventually build a cross-tenant bug with better branding.
The worker path does the real work:
Load the inbox record
If needed, fetch the latest provider object by ID
Apply an idempotent state transition in your DB transaction
Mark the inbox row
processedorfailedEmit internal domain events through an outbox, not a direct broker publish
This is where the idempotent consumer pattern stops being architecture-book advice and starts being payroll advice. If a message queue or webhook sender is at-least-once, your consumer has to be safe when invoked more than once.
And yes, your queue is also at-least-once. SQS says so explicitly. Visibility timeouts and DLQs help with recovery, but they do not magically create exactly-once processing.
What “idempotent” actually means here
A lot of teams say “our handler is idempotent” when they really mean “we hope duplicate requests are rare.” That is not the same thing.
In webhook systems, idempotency usually means:
Reprocessing the same event does not produce another business mutation
Side effects are keyed, recorded, or converted to upserts
State transitions are monotonic where possible
Older events do not overwrite newer state
For example, don’t let invoice.payment_failed move an order backward if you already applied invoice.paid after fetching the latest invoice state. Compare timestamps or version counters. Better yet, treat the webhook as a prompt to reconcile current truth from the provider API for critical objects.
That last part matters for consistency and rate limits. If you fetch full provider state on every event, you can create your own secondary outage by hammering the upstream API. So be selective:
Fetch for money, entitlements, and irreversible actions
Trust payloads for low-risk analytics or notifications
Rate-limit worker concurrency per provider account
Add jittered retries for provider API calls
If a flow becomes multi-step, long-running, or needs timers and human approval, stop stretching queue consumers into homemade orchestration. Use durable execution tooling there. The queue is for delivery and buffering, not for pretending your business process is a finite state machine made of retries.
The observability most teams skip
The painful part of webhook failures is not delivery failure. It’s delivery success with business failure.
A 200 only means your endpoint acknowledged receipt. It does not mean the customer got access, the invoice applied, or the internal event propagated.
You want first-class states like:
receivedprocessingprocesseddeferredfaileddead_lettered
Then measure:
Ack latency
Queue lag
Duplicate rate
Processing latency
DLQ depth
Replay count
Per-tenant failure rate
Reconciliation drift
Also build replay on day one. Stripe supports manual resends and retries undelivered events for up to three days in many cases, but your system still needs internal replay and auditability for when *your* worker failed after intake. Their docs on undelivered events are worth reading because they push you toward explicit processing status, not blind re-execution.
The tradeoffs
This architecture is better, not free.
You add storage, a queue, workers, and ops overhead
You accept eventual consistency instead of synchronous illusion
You need replay tooling and retention policies for raw payloads
You need to think about tenant routing, schema evolution, and PII handling
You may need reconciliation jobs for high-value domains
That is still cheaper than debugging "paid but not provisioned" from logs that only show 200 OK.
The opinionated rule is simple: a webhook should enter your system through a narrow, durable boundary, then get processed asynchronously by idempotent workers with observable state. If you mutate business state directly in the request handler, you are not building an event-driven system. You are just hiding distributed systems failure modes inside a controller.
That approach always looks simple right before it starts charging interest.