The Job Queue Is Not a Workflow Engine

Most engineers reach for a job queue when they need async work done — but queues track one bit of state: pending or done. When a task spans multiple steps with real side effects, that's not enough. This article shows why multi-step jobs need durable execution, how checkpoint-based engines eliminate the partial-state trap, and where the tradeoffs land.

The pattern is familiar: something needs to happen asynchronously, so you push a job onto a queue. A worker picks it up, does the thing, marks it done. Simple, scalable, battle-tested.

Until the "thing" isn't one step — it's five.

The Fire-and-Forget Trap

Provisioning a new user means: charge the card, create the account row, assign a default team, seed initial permissions, send the welcome email. Five side effects. Five separate failure modes. And your job queue has exactly one bit of state for all of them: pending or done.

When step 3 crashes, the job goes back to pending. Steps 1 and 2 already happened. The queue doesn't know. The retry runs from scratch.

That's not a bug in your code — it's a mismatch between the tool and the problem.

What a Queue Actually Guarantees

Job queues are built around one promise: a message is delivered to a worker at least once. That's the whole contract. The queue doesn't track "step 2 of 5 completed." It tracks "the message exists and hasn't been acknowledged."

When a worker crashes mid-job, the visibility timeout expires and the message surfaces again. Another worker picks it up from step 1. You've now charged the card twice, or created two accounts, or landed in some hybrid state nobody designed for.

The classic fix is manual idempotency: before each side effect, check if it already happened. "Look up the existing charge before creating one." This works at small scale, but it turns your worker into a hand-rolled distributed state machine. Every branch condition, parallel step, or human-approval gate means another IF EXISTS check and another status column to maintain.

The queue did exactly what it was supposed to. The problem is you never gave it a concept of "steps."

Durable Execution: The Model That Actually Fits

Durable execution shifts the state model. Instead of tracking whether a job is pending or done, the *workflow* is the unit of state. Each step is checkpointed after it completes. If the process crashes between step 2 and step 3, the next run skips the first two and resumes at step 3.

The code looks almost like a normal async function:

export class UserProvisioning extends WorkflowEntrypoint<Env, Params> {
  async run(event: WorkflowEvent<Params>, step: WorkflowStep) {
    const charge = await step.do('charge-card', () =>
      stripe.charges.create({ amount: event.payload.amount })
    );

    await step.do('create-account', () =>
      db.users.create({ userId: event.payload.userId, chargeId: charge.id })
    );

    await step.do('send-welcome', () =>
      mailer.send({ to: event.payload.email, template: 'welcome' })
    );
  }
}

Each step.do() call serializes its result before continuing. If the worker restarts, the workflow replays: charge-card is skipped because a stored result already exists. The step function's output is the checkpoint.

Temporal, AWS Step Functions, and Cloudflare Workflows all implement some version of this. The checkpoint store is what queues lack. It's what turns "at-least-once delivery" into "exactly-once progress."

Tradeoffs Worth Knowing

**Checkpoint overhead adds latency.** Every step serializes output and writes it to durable storage before moving on. For high-throughput, single-step tasks — image compression, notification fanout, log aggregation — this overhead is unjustifiable. Job queues win there.

**Idempotency is still on you.** Durable execution handles replay inside the workflow, but it can't deduplicate calls to external systems. If a network timeout fires after Stripe processed the request, you still need idempotency keys on the Stripe call. step.do() re-runs the function if the checkpoint wasn't written before the crash — the downstream API sees that request twice.

**Debugging gets better, but different.** With a job queue, understanding "what happened to job 482" means correlating log lines across workers. With durable execution, the workflow history *is* the audit trail — every step has a timestamp, input, output, and retry count. Postmortems are faster. But your existing dashboards don't know what a workflow ID is, so you're rebuilding your observability queries around a new primitive.

**Vendor surface area is real.** Temporal needs a server (or Temporal Cloud). Cloudflare Workflows is Workers-native. Step Functions charges per state transition. Rolling your own on top of a Postgres table is possible but tedious to get right under concurrent load — you'll end up reimplementing advisory locks and exactly-once semantics by hand.

The Diagnostic

The signal that you've outgrown a job queue: you added an in_progress_since column and a sweeper cronjob to reset stuck jobs. You've reinvented workflow checkpointing — badly, in SQL — because the queue had no way to express partial progress.

That's the moment to reach for durable execution instead of adding another status enum value.

The rule:

**Job queue**: Single step, high throughput, tolerable duplicates. Background compute, thumbnail generation, push notifications.
**Durable execution**: Multiple steps, business-critical, low tolerance for partial state. Onboarding flows, payment sequences, anything a customer emails you about when it silently fails.

Neither replaces the other. The mistake is using fire-and-forget for workflows that, by nature, aren't.