Your Write Path Should Not Also Be Your Cache Invalidation Pipeline
A practical architecture for cache invalidation that avoids dual writes, survives retries, and keeps multi-tenant systems observable and boring.
The mistake: treating cache invalidation like a side effect you can "just do after commit"
A very normal architecture mistake looks like this:
API handler writes the source-of-truth row
API handler publishes an event
API handler invalidates Redis keys or CDN tags
API handler maybe pokes a search index too
everyone assumes the system is now consistent
It works in staging. It even works in production for a while. Then one day the database commit succeeds, the event publish times out, the function retries, Redis gets half-invalidated, and one tenant keeps seeing old data for twenty minutes while another tenant gets the fresh version immediately.
That is not a cache bug. That is a boundary bug.
You turned one request into a distributed transaction without admitting it.
The core problem is the classic dual-write failure mode. AWS calls this out directly in its transactional outbox pattern: if the database write succeeds and the event notification fails, downstream systems never hear about the change. If the reverse happens, you can publish a change that never committed.
Caches make this worse because they make the failure look random. The primary database is correct, but every derived system gets to be wrong in its own special way.
The better architecture
The fix is boring on purpose:
keep the request path responsible for one durable state transition
write an
outboxrecord in the same transaction as the primary writelet workers process committed events asynchronously
make every downstream mutation idempotent
measure freshness explicitly instead of pretending consistency is free
If you are on Postgres or MySQL, the outbox table pattern is still one of the cleanest answers. Debezium's outbox event router exists for exactly this reason: capture committed outbox rows and route them as events without asking your request handler to talk to the broker directly.
The request path does one transaction:
BEGIN;
UPDATE account
SET plan = 'pro', version = version + 1
WHERE tenant_id = $1 AND account_id = $2;
INSERT INTO outbox_events (
event_id,
tenant_id,
aggregate_type,
aggregate_id,
aggregate_version,
event_type,
payload,
created_at
) VALUES (
$3, $1, 'account', $2, $4, 'account.plan_changed', $5, NOW()
);
COMMIT;That boundary matters. The API's job is to durably accept the command. It is not the API's job to synchronously drag every cache, read model, and edge layer into agreement before returning.
Why this architecture holds up better
It makes consistency honest
After commit, your system is not strongly consistent across every derived store. It is eventually consistent, but intentionally so. That is much healthier than accidental eventual consistency hidden behind a 200 OK.
The outbox record is the contract: "this change committed; now propagate it." That is a real state machine, not wishful thinking.
It gives retries somewhere safe to live
Queue consumers can retry without re-running the original request handler. That matters a lot in serverless and edge-heavy systems, where the request lifetime is short and partial failures are common.
Queue semantics are still not magic. Cloudflare Queues documents default retries, delayed reprocessing, and DLQs, and it also points out an easy footgun: if one message in a batch fails, the whole batch can be retried unless you acknowledge messages explicitly. That is exactly why workers must be idempotent.
If a worker calls an external API, pass an idempotency key when the provider supports it. Stripe's idempotent requests are the model here: same key, same effective result. If your worker retries after a timeout, you want the provider to dedupe for you.
It keeps tenancy attached to the data
A surprising number of cache incidents are not invalidation failures. They are key design failures.
If your event payload does not carry tenant_id, aggregate_id, and a monotonic version, you are already making the worker guess. Guessing is how one tenant's invalidation can touch another tenant's cache namespace.
Every downstream key should be tenant-scoped. Every event should be tenant-scoped. Every replay should be safe to run twice.
A practical pattern is:
cache keys include
tenant_idevents include
event_id,tenant_id,aggregate_id,aggregate_versionworkers only apply updates when the incoming version is newer than the cached version
dedupe tables track processed
event_ids when needed
That is much more reliable than DEL user:123 and a prayer.
It makes observability useful instead of decorative
If you cannot answer "how stale can this read be right now?" you do not have observability. You have dashboards.
At minimum, instrument:
outbox lag: oldest unprocessed committed event age
queue lag: enqueue-to-consume latency
retry rate and DLQ volume
dedupe hit rate
per-tenant stale-read age for critical entities
Also propagate trace context across the message boundary. OpenTelemetry's messaging semantic conventions are very explicit: producers should attach message creation context so consumer traces can be correlated back to the original operation. If you skip that, your async pipeline becomes a blame tunnel.
Where durable execution fits
Sometimes a simple queue consumer is enough. Sometimes cache repair fans out into multi-step work:
invalidate Redis
purge CDN tags
rebuild a projection
call a rate-limited vendor API
wait for a callback or a cooldown window
Once that process spans minutes or hours, stop inventing workflow logic out of retries and visibility timeouts. Use durable execution.
The current AWS Durable Execution idempotency guidance says the quiet part out loud: replay and retry can run the same operation more than once, so side effects must be matched to the right execution semantics. That is the right mental model. Durable execution is not about fancy orchestration diagrams. It is about making long-lived recovery behavior explicit.
The tradeoffs
This architecture is better, not free.
You add components: outbox, relay, queue, workers, DLQ, metrics.
You accept a freshness window instead of pretending every read is instantly current.
You need event schemas and versioning discipline.
Ordering gets real fast. AWS explicitly notes duplicates and ordering concerns in outbox pipelines, especially with at-least-once delivery.
Your platform team needs tooling for replay, backfill, and poison-message handling.
Still, those are good costs. They are visible costs.
The worse architecture is the one that looks simpler because all the complexity is hiding inside request retries, random stale reads, and on-call folklore.
The rule I would actually enforce
If a write must update caches, indexes, flags, search, or edge state, the request handler should only commit primary state plus a durable propagation record.
Everything else belongs behind an async boundary with idempotent workers, tenant-safe keys, retry policy, and traceable lag.
That sounds slower. In practice it is usually faster where it counts: fewer user-visible inconsistencies, fewer impossible incidents, and far less time spent pretending cache invalidation is a synchronous API concern.