Back to BlogCloud

Your Write Path Should Not Also Be Your Cache Invalidation Pipeline

A practical architecture for cache invalidation that avoids dual writes, survives retries, and keeps multi-tenant systems observable and boring.

cachingsystem-designevent-drivenobservabilityqueuesmulti-tenant
Your Write Path Should Not Also Be Your Cache Invalidation Pipeline

The mistake: treating cache invalidation like a side effect you can "just do after commit"

A very normal architecture mistake looks like this:

  • API handler writes the source-of-truth row

  • API handler publishes an event

  • API handler invalidates Redis keys or CDN tags

  • API handler maybe pokes a search index too

  • everyone assumes the system is now consistent

It works in staging. It even works in production for a while. Then one day the database commit succeeds, the event publish times out, the function retries, Redis gets half-invalidated, and one tenant keeps seeing old data for twenty minutes while another tenant gets the fresh version immediately.

That is not a cache bug. That is a boundary bug.

You turned one request into a distributed transaction without admitting it.

The core problem is the classic dual-write failure mode. AWS calls this out directly in its transactional outbox pattern: if the database write succeeds and the event notification fails, downstream systems never hear about the change. If the reverse happens, you can publish a change that never committed.

Caches make this worse because they make the failure look random. The primary database is correct, but every derived system gets to be wrong in its own special way.

The better architecture

The fix is boring on purpose:

  • keep the request path responsible for one durable state transition

  • write an outbox record in the same transaction as the primary write

  • let workers process committed events asynchronously

  • make every downstream mutation idempotent

  • measure freshness explicitly instead of pretending consistency is free

If you are on Postgres or MySQL, the outbox table pattern is still one of the cleanest answers. Debezium's outbox event router exists for exactly this reason: capture committed outbox rows and route them as events without asking your request handler to talk to the broker directly.

The request path does one transaction:

BEGIN;

UPDATE account
SET plan = 'pro', version = version + 1
WHERE tenant_id = $1 AND account_id = $2;

INSERT INTO outbox_events (
  event_id,
  tenant_id,
  aggregate_type,
  aggregate_id,
  aggregate_version,
  event_type,
  payload,
  created_at
) VALUES (
  $3, $1, 'account', $2, $4, 'account.plan_changed', $5, NOW()
);

COMMIT;

That boundary matters. The API's job is to durably accept the command. It is not the API's job to synchronously drag every cache, read model, and edge layer into agreement before returning.

Why this architecture holds up better

It makes consistency honest

After commit, your system is not strongly consistent across every derived store. It is eventually consistent, but intentionally so. That is much healthier than accidental eventual consistency hidden behind a 200 OK.

The outbox record is the contract: "this change committed; now propagate it." That is a real state machine, not wishful thinking.

It gives retries somewhere safe to live

Queue consumers can retry without re-running the original request handler. That matters a lot in serverless and edge-heavy systems, where the request lifetime is short and partial failures are common.

Queue semantics are still not magic. Cloudflare Queues documents default retries, delayed reprocessing, and DLQs, and it also points out an easy footgun: if one message in a batch fails, the whole batch can be retried unless you acknowledge messages explicitly. That is exactly why workers must be idempotent.

If a worker calls an external API, pass an idempotency key when the provider supports it. Stripe's idempotent requests are the model here: same key, same effective result. If your worker retries after a timeout, you want the provider to dedupe for you.

It keeps tenancy attached to the data

A surprising number of cache incidents are not invalidation failures. They are key design failures.

If your event payload does not carry tenant_id, aggregate_id, and a monotonic version, you are already making the worker guess. Guessing is how one tenant's invalidation can touch another tenant's cache namespace.

Every downstream key should be tenant-scoped. Every event should be tenant-scoped. Every replay should be safe to run twice.

A practical pattern is:

  • cache keys include tenant_id

  • events include event_id, tenant_id, aggregate_id, aggregate_version

  • workers only apply updates when the incoming version is newer than the cached version

  • dedupe tables track processed event_ids when needed

That is much more reliable than DEL user:123 and a prayer.

It makes observability useful instead of decorative

If you cannot answer "how stale can this read be right now?" you do not have observability. You have dashboards.

At minimum, instrument:

  • outbox lag: oldest unprocessed committed event age

  • queue lag: enqueue-to-consume latency

  • retry rate and DLQ volume

  • dedupe hit rate

  • per-tenant stale-read age for critical entities

Also propagate trace context across the message boundary. OpenTelemetry's messaging semantic conventions are very explicit: producers should attach message creation context so consumer traces can be correlated back to the original operation. If you skip that, your async pipeline becomes a blame tunnel.

Where durable execution fits

Sometimes a simple queue consumer is enough. Sometimes cache repair fans out into multi-step work:

  • invalidate Redis

  • purge CDN tags

  • rebuild a projection

  • call a rate-limited vendor API

  • wait for a callback or a cooldown window

Once that process spans minutes or hours, stop inventing workflow logic out of retries and visibility timeouts. Use durable execution.

The current AWS Durable Execution idempotency guidance says the quiet part out loud: replay and retry can run the same operation more than once, so side effects must be matched to the right execution semantics. That is the right mental model. Durable execution is not about fancy orchestration diagrams. It is about making long-lived recovery behavior explicit.

The tradeoffs

This architecture is better, not free.

  • You add components: outbox, relay, queue, workers, DLQ, metrics.

  • You accept a freshness window instead of pretending every read is instantly current.

  • You need event schemas and versioning discipline.

  • Ordering gets real fast. AWS explicitly notes duplicates and ordering concerns in outbox pipelines, especially with at-least-once delivery.

  • Your platform team needs tooling for replay, backfill, and poison-message handling.

Still, those are good costs. They are visible costs.

The worse architecture is the one that looks simpler because all the complexity is hiding inside request retries, random stale reads, and on-call folklore.

The rule I would actually enforce

If a write must update caches, indexes, flags, search, or edge state, the request handler should only commit primary state plus a durable propagation record.

Everything else belongs behind an async boundary with idempotent workers, tenant-safe keys, retry policy, and traceable lag.

That sounds slower. In practice it is usually faster where it counts: fewer user-visible inconsistencies, fewer impossible incidents, and far less time spent pretending cache invalidation is a synchronous API concern.