Back to BlogDevOps

Your Database Committed. Your Event Bus Didn't. Here's the Fix.

The dual-write problem silently drops events in event-driven systems when your database write succeeds but your event bus publish fails. Learn how the outbox pattern makes both writes atomic and what operational tradeoffs you are signing up for.

system-designoutbox-patternevent-drivenconsistencypostgresql
Your Database Committed. Your Event Bus Didn't. Here's the Fix.

The Two-Line Bug Hiding in Every Event-Driven Service

Here's a pattern that ships to production all the time:

typescript const order = await db.orders.create({ userId, items, total }); await eventBus.publish('order.created', { orderId: order.id, userId, total });

Happy-path tests pass every time. But this is a silent consistency bug sitting directly under your request handler.

What happens when:

  • The event bus goes down for 30 seconds during a deploy?

  • Your process gets OOM-killed between those two lines?

  • A network blip hits after the DB write but before the publish?

You end up with an order row in your database that your fulfillment service, invoice service, and analytics pipeline will never know about. The customer got charged. Nothing ships.

This is the **dual-write problem** — two separate systems, two separate failure domains, zero atomicity between them.

Why the Naive Fix Doesn't Work

Wrapping it in try/catch and retrying doesn't help:

typescript try { const order = await db.orders.create({ userId, items, total }); await eventBus.publish('order.created', { orderId: order.id }); } catch (err) { // Did the DB write succeed? Did the publish succeed? // Retrying creates a duplicate order if the DB write already committed. }

You can't safely retry the whole block because you don't know which operation failed. If the DB write succeeded but the publish failed, a retry creates a duplicate order. If the DB write failed, there's nothing to publish. You're guessing either way.

The root cause: databases and message buses don't share a transaction coordinator. There's no distributed two-phase commit that makes both writes atomic. Every system you reach outside a DB transaction is a separate failure domain, and failures in that domain are invisible to your DB.

mermaid sequenceDiagram participant App participant DB participant Bus as Event Bus

App->>DB: INSERT order DB-->>App: OK (committed) App->>Bus: publish order.created Bus--xApp: TIMEOUT Note over App,Bus: DB has the row. Bus never got the event.

The Fix: Write to an Outbox, Not Directly to the Bus

The outbox pattern keeps both writes inside a single database transaction. You write to your normal table **and** an outbox table atomically. A separate background process reads from the outbox and publishes to the event bus. If publishing succeeds, the row gets marked sent. If it fails, the poller retries on the next pass.

sql CREATE TABLE outbox ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), event_type TEXT NOT NULL, payload JSONB NOT NULL, created_at TIMESTAMPTZ DEFAULT now(), sent_at TIMESTAMPTZ );

typescript await db.transaction(async (tx) => { const order = await tx.orders.create({ userId, items, total }); await tx.outbox.create({ eventType: 'order.created', payload: { orderId: order.id, userId, total }, }); // Both commit together or both roll back together. });

A background worker polls the outbox and delivers pending events:

typescript const pending = await db.outbox.findMany({ where: { sentAt: null }, orderBy: { createdAt: 'asc' }, take: 100, });

for (const row of pending) { await eventBus.publish(row.eventType, row.payload); await db.outbox.update({ where: { id: row.id }, data: { sentAt: new Date() }, }); }

Your write path is now atomic. The DB either commits both the order row and the outbox row, or it commits neither. The event bus failure is completely decoupled from your request path — the request finishes fast, the event delivers eventually.

mermaid flowchart LR App([App]) -->|transaction| DB[(Postgres)] DB -->|writes| OB[(outbox)] OB -->|poll| W[Outbox Worker] W -->|publish| EB([Event Bus]) W -->|mark sent| OB

Polling vs. Change Data Capture

You have two options for driving the outbox worker:

**Polling**: A worker queries WHERE sent_at IS NULL on a fixed interval — typically 500ms to 5 seconds. Simple to build, zero additional infrastructure, adds latency equal to your poll interval. Works with any relational database and is the right starting point for almost every team.

**Change Data Capture (CDC)**: Tools like Debezium or Postgres logical replication stream changes from the outbox table in near real-time. Lower latency (sub-second), much higher operational complexity. Worth considering if downstream services need events quickly and you have the platform maturity to run CDC in production.

Start with polling. CDC is an optimization you reach for when latency becomes a measured problem, not a hypothetical one.

The Tradeoffs You're Actually Taking On

**At-least-once delivery is now guaranteed — and required.** If the worker publishes an event and crashes before marking it sent, it publishes again on restart. That means duplicates are a normal operational condition, not a bug. Your consumers need to handle them. Use an eventId field or the natural resource ID to deduplicate on the consumer side. This is non-negotiable; skip it and you'll get duplicate orders processed in production.

**You've added a table and a process to operate.** The outbox table needs an index on sent_at or created_at, and you need periodic cleanup of old sent rows (a simple cron or a TTL-based DELETE). The polling worker is a new service boundary that can fall behind or get stuck. That's real operational overhead — not insurmountable, but don't pretend it's free.

**This doesn't solve cross-database writes.** If you're writing to two separate databases in the same request, the outbox only gives you atomicity within one of them. Cross-service consistency across multiple data stores is a saga problem — different, harder, and requires explicit compensation logic.

**Events will be delayed by at least one poll cycle.** If a downstream service needs the event within milliseconds of the commit, polling won't cut it. But honestly, if you need synchronous guaranteed delivery, you probably haven't actually decoupled your services — you've just moved the coupling into the event contract.

The Observability Signal You Should Not Skip

Add a metric for outbox queue depth:

sql SELECT COUNT(*) FROM outbox WHERE sent_at IS NULL;

If that number grows and doesn't shrink, your worker is stuck or your event bus is down. Alert on it — something like "outbox depth has exceeded 500 rows for more than 2 minutes" is a reliable early warning. A growing outbox depth is a far better signal for a broken event pipeline than trying to correlate log lines across three services after an incident is already underway. Log the worker's publish latency (time from created_at to sent_at) as a histogram too; a sudden spike tells you when the bus is slow before consumers start complaining.

The Short Version

If your code does "write to DB, then write to queue" sequentially, one of them can fail silently and leave your system in a split state with no clean way to reconcile. The outbox pattern fixes this by making both writes atomic inside the DB, then delivering to the bus asynchronously. You trade a little latency and some operational complexity for real consistency guarantees. For any event-driven system where missed events cause actual user pain — orders not shipped, invoices not sent, analytics not recorded — that tradeoff is worth making. The code is straightforward; the hard part is accepting that at-least-once delivery is now your baseline and designing your consumers accordingly.