Your Redis Cache Has a Multi-Tenant Data Leak You Haven't Found Yet
A realistic look at the multi-tenant cache isolation bugs most SaaS teams ship without noticing, and how to fix them with tenant-namespaced keys, event-driven invalidation, and probabilistic stampede prevention.
Your Redis Cache Has a Multi-Tenant Data Leak You Haven't Found Yet
Most SaaS engineers add Redis early and never question it again. It works in demos. Then you onboard your second customer and the clock starts ticking on a data isolation bug you won't find until that customer reports it — loudly, to your CEO.
The Mistake Everyone Makes
Here's the pattern I see constantly. You have a multi-tenant API. You add Redis caching. You key responses by something like user:${userId}:profile or, worse, just ${resourceId}. Works in development, works in staging, ships to production.
The problem: your key strategy was built for a single-tenant mental model. The moment two tenants can share a resourceId namespace — which happens with numeric auto-increment IDs, or any time tenant IDs aren't embedded in every key — you have a latent data leak.
js // The mistake: no tenant context in the key const cached = await redis.get(invoice:${invoiceId});
If tenant A's invoice 1042 gets cached and tenant B requests their own invoice 1042, they might get tenant A's cached response. It's not every system — but any system where IDs aren't globally unique across all tenants is at risk.
Add to that the noisy-neighbor problem: tenant A generates 10x the cache traffic, fills your LRU pool, and evicts tenant B's data. Tenant B then hammers your database. Now you have a cascade failure that gets filed as a "database performance issue" and never traced back to the cache.
The Stampede Nobody Talks About
There's a second failure mode that hits at scale: cache stampede on cold start.
When a popular key expires — or gets evicted under memory pressure — every concurrent in-flight request misses simultaneously and fires a database query. For a busy endpoint, that's hundreds of queries in under a second.
Key expires → 400 concurrent DB queries → CPU spikes → p99 blows past SLA → alerts fire
You added caching to protect the database. The cache itself becomes the failure vector. DDIA calls this a thundering herd, and it's more common than the graphs suggest because the spike is sharp enough to miss in most monitoring aggregations.
A Better Architecture
Fix both problems with three changes: tenant-namespaced keys, probabilistic early expiry, and event-driven invalidation.
**Namespace every key by tenant**
js // Always embed tenant context const key = t:${tenantId}:invoice:${invoiceId}; const cached = await redis.get(key);
Prefix every key with t:{tenantId}. Non-negotiable. Enforce this in your cache client layer — a thin wrapper around your Redis client — so individual engineers can't accidentally bypass it by calling Redis directly.
**Probabilistic early expiry to kill stampedes**
Instead of all requests missing at TTL=0, start recomputing before expiry using controlled probability. The formula from the XFetch paper (Vattani et al., 2015) handles this cleanly:
js function shouldRecompute(remainingTtl, recomputeMs, beta = 1.0) { const delta = recomputeMs / 1000; return Date.now() / 1000 - delta * beta * Math.log(Math.random()) >= remainingTtl; }
One background request recomputes early; every other concurrent request keeps serving the still-valid cached value. Stampede dead. beta controls aggressiveness — higher values recompute earlier, lower values cut it closer.
**Event-driven invalidation instead of TTL guesswork**
TTLs are a blunt instrument. You pick 5 minutes because you don't know how stale is acceptable. Better: publish a cache invalidation event whenever the underlying data changes.
mermaid flowchart LR API -->|read| Cache Cache -->|miss| DB DB -->|populate| Cache API -->|write| DB DB -->|emit| Bus Bus -->|trigger| Worker Worker -->|invalidate| Cache
Your write path publishes an event. A lightweight invalidation worker listens and deletes the relevant keys. Cache stays fresh without TTL guesswork, and you can make the staleness window as tight as your event delivery latency.
js // On write await db.update('invoices', { id: invoiceId, ...patch }); await eventBus.publish('invoice.updated', { tenantId, invoiceId });
// Invalidation worker eventBus.subscribe('invoice.updated', async ({ tenantId, invoiceId }) => { await redis.del(t:${tenantId}:invoice:${invoiceId}); });
Tradeoffs You Need to Own
**Event-driven invalidation adds moving parts.** You now have an event bus, a worker process, and an invalidation lag window. During that lag, reads return stale data. For most product features this is fine. For financial audit trails, GDPR deletion flows, or anything where stale data carries legal weight, you might need synchronous invalidation on the write path and accept the latency cost instead.
**Probabilistic early expiry needs calibration.** The delta parameter (expected recompute time in milliseconds) must match actual population cost. Underestimate it and you still stampede. Overestimate it and you recompute constantly, burning database capacity that the cache was supposed to save. Instrument recompute duration in production and tune from real numbers, not estimates.
**Per-tenant namespacing costs memory.** If your Redis is already near capacity, you need to size up or accept higher eviction rates. Redis 7+ cluster mode supports per-keyspace memory limits through prefix-aware tooling, but that adds operational surface area. Know your memory budget before rolling this out — a surprise OOM during peak traffic is worse than the original noisy-neighbor problem.
**Eviction policy matters more than you think.** allkeys-lru is the right default for most caches: it evicts under memory pressure without causing hard failures. noeviction returns errors under memory pressure, turning a cache miss into a 500. volatile-lru only evicts keys with TTLs set, which surprises you when some keys don't have one. Set the policy explicitly in your Redis config and document it; never rely on whatever the upstream managed service default happens to be.
What Good Observability Looks Like
Cache bugs are silent until they're catastrophic. Four metrics to add before you think you need them:
**Hit rate per tenant** — a drop from 80% to 20% signals an isolation or eviction problem for that tenant specifically
**Key eviction rate** — spikes correlate with noisy-neighbor evictions knocking out other tenants' data
**Invalidation lag** — time between a write event being published and the cache key being deleted; a growing lag means your worker is falling behind
**Stampede recovery time** — how long p99 latency takes to normalize after a cold start; this is your real SLA floor, not the warm-cache p99
Without per-tenant cache metrics, you're operating blind. Your aggregate hit rate will look healthy at 78% while one enterprise tenant's data has been completely stale for six hours.
The Bottom Line
A flat, TTL-only Redis cache is a single-tenant tool in a multi-tenant costume. Tenant-namespaced keys aren't an optimization — they're the baseline for data isolation in any SaaS system. Event-driven invalidation replaces TTL guesswork with explicit consistency boundaries. Probabilistic early expiry kills stampedes before they cascade into your database.
The extra complexity is real and you'll feel it during your first incident with the invalidation worker. But silent cross-tenant data leaks are also complex — just the kind that controls you instead of the other way around.