Stop Building Workflow Engines With Redis and Cron Jobs

You think you have a queue-based system. What you actually have is a workflow engine with no visible state machine, no audit trail, and partial failures that surface as customer support tickets six months later. Here's why durable execution fixes what Redis and cron jobs can't.

You have a user onboarding flow. When someone signs up, you:

1. Send a welcome email 2. Wait 24 hours — if they haven't activated, send a nudge 3. After 7 days still inactive, flag the account for review

It fits on a whiteboard. The implementation takes a week. Then, six months later, a customer reports they never got their reminder. You dig in. The Redis key is gone. The cron job ran but threw a connection pool error it silently swallowed. Three engineers debate what the "correct" state of this user should be. You fix it with a manual script, merge a hotfix, and go home vaguely uneasy.

This is the pattern I see everywhere. **You built a workflow engine on top of queues and Redis. It just doesn't know it's a workflow engine.**

Why the queue-plus-Redis approach quietly accumulates debt

When you build a multi-step workflow with queues and cron jobs, you end up with state spread across multiple systems:

Your database holds the canonical user record
Redis holds ephemeral in-flight state (timers, step flags, intermediate results)
Your queue holds work that hasn't run yet

mermaid flowchart LR Signup --> SendWelcome SendWelcome --> SaveState[Save to Redis] SaveState --> Cron[Cron Checks Redis] Cron -->|key expired| Stuck[User Lost] Cron -->|24h elapsed| Reminder[Send Reminder] Reminder --> SetFlag[Update DB] SetFlag --> Cron2[Next Cron Step]

Every arrow in that diagram is a potential inconsistency. If SendWelcome fails after SaveState succeeds, the user gets a reminder but never got a welcome email. If the cron job crashes between reading Redis and writing the DB, you've got phantom state — Redis says "done," the database says "not done." Which one is right?

The worst part isn't the bugs. It's the **lack of observability**. You can look at job logs and see that a job ran. You cannot answer "what is the complete history of this specific user's onboarding workflow?" without joining logs across three systems and doing mental archaeology.

The deeper problem: state lives nowhere

The root issue is that your workflow's state machine is **implicit**. It's encoded in which queues have which messages, which Redis keys exist with which TTLs, and which cron expressions fire when. There's no single place in your codebase that says "user 4521 is currently in the waiting-for-activation phase of onboarding."

This makes everything harder:

**Debugging**: every incident is a multi-system forensic investigation
**Recovery**: partial failures need hand-written one-off migration scripts
**Testing**: you can't unit-test a state machine that doesn't exist as a state machine
**Evolving the workflow**: adding a new step means touching the queue config, the Redis key schema, and the cron logic separately — and hoping nothing is in-flight when you deploy

This is the failure mode that Designing Data-Intensive Applications flags over and over: when the source of truth is ambiguous across systems, you can't reason about correctness under partial failure. The queue-plus-Redis pattern encodes your business logic in infrastructure state rather than in code.

The better architecture: durable execution

Durable execution engines — Temporal, Inngest, Trigger.dev, Cloudflare Workflows — flip this around. You write your workflow as ordinary async code, and the runtime handles persistence, retries, timeouts, and replay transparently.

typescript // Inngest export const onboardingFlow = inngest.createFunction( { id: 'user-onboarding' }, { event: 'user/signup' }, async ({ event, step }) => { await step.run('send-welcome', () => email.send(event.data.userId, 'welcome') );

await step.sleep('wait-24h', '24 hours');

const user = await step.run('check-activation', () => db.users.findById(event.data.userId) );

if (!user.activatedAt) { await step.run('send-reminder', () => email.send(event.data.userId, 'reminder') ); }

await step.sleep('wait-7d', '7 days');

const userAfterWeek = await step.run('check-week', () => db.users.findById(event.data.userId) );

if (!userAfterWeek.activatedAt) { await step.run('flag-inactive', () => db.users.update(event.data.userId, { flaggedForReview: true }) ); } } );

That's the entire workflow. No Redis keys. No cron jobs. No implicit state machine scattered across your infrastructure.

Each step.run() call is durably persisted before execution. If your server crashes mid-workflow, the engine replays from the last successful checkpoint — it does not re-execute steps that already completed. step.sleep('24 hours') doesn't block a thread or schedule a cron expression; it suspends the workflow and resumes it when the time comes.

What you actually get

**A free audit trail**: Every step that ran, what it returned, and when — visible in the dashboard. "Why didn't user 4521 get their reminder?" becomes a two-second lookup instead of a thirty-minute investigation across three log sources.

**Reliable partial retries**: Each step retries independently with configurable backoff. A flaky email API on step 1 doesn't force you to re-run the entire workflow from scratch.

**Cancellation and replay**: You can cancel a running workflow, replay it from a specific step, or pause it mid-execution — all things that are genuinely hard to build on top of raw queues without a lot of custom bookkeeping.

**The code is the documentation**: New engineers can read the function and understand exactly what the workflow does. There's no mental model to reconstruct from scattered cron files and Redis key naming conventions.

The tradeoffs

This isn't free, and pretending otherwise would be dishonest.

**Operational dependency**: Temporal is powerful but heavy to self-host — it runs on Cassandra or PostgreSQL and has real infrastructure requirements. Inngest and Trigger.dev are much simpler to start with but are cloud products, so you're trading one dependency (Redis + cron) for another (third-party execution platform). Know what you're signing up for.

**Step granularity is a real design decision**: Every step.run() boundary is a serialization and checkpoint point. Too coarse and you lose granular retries. Too fine and you have overhead on every trivial operation. Getting the right level of step decomposition takes some practice — you'll probably get it wrong the first time and have to refactor.

**Not worth it for simple jobs**: A single background task that sends one email doesn't need a workflow engine. This pattern pays off when you have multi-step processes with waits, conditionals, and error recovery paths — basically any time the question "what state is this workflow in right now?" is something you'd have to think about.

**Local dev quality varies**: Inngest ships a solid local dev server that mirrors production. Temporal's local setup involves Docker Compose and a moderate amount of patience. This isn't a dealbreaker but it does matter for daily developer experience.

The tell

The signal that you need durable execution isn't complexity in the code itself. It's when you start writing custom logic to reconcile partial failures across multiple systems — "if this Redis key is set but that DB row isn't, then..." At that point you're already building a workflow engine. The only question is whether you want a purpose-built one with a proper state model, or a handcrafted one that six engineers will eventually be afraid to touch because nobody fully understands it anymore.

The Redis-and-cron approach works, mostly, until it silently doesn't. Durable execution makes failure visible, inspectable, and recoverable instead of invisible and archaeological.