<- blog

Workflows Need Recovery Contracts

Cloudflare's rollback-handler update is a useful reminder: business automations need explicit recovery paths, not just retry buttons.

#automation#reliability#operations

Cloudflare's June 23 Workflows changelog entry is a small platform update with a big product lesson. Rollback handlers now receive the original step context for the step being rolled back: the step name, attempt count, retry configuration, timeout configuration, and the defaults that were applied.

That sounds like implementation detail until you map it to real business workflows. Quote requests, booking flows, lead routing, content publishing, invoice generation, stock sync, enrichment jobs, and support automations all have the same uncomfortable property: the failure is rarely isolated to one function call. By the time something breaks, a previous step may already have written a record, sent an email, charged a card, reserved inventory, or updated a CRM.

The useful signal in the update

Cloudflare describes Workflows as durable, multi-step applications on Workers that retry automatically and persist state. The new rollback context makes recovery logic less blind. A rollback handler can know which step failed, how many times it ran, and which retry or timeout rules applied.

That matters because most failed automations do not need the same response.

Failure type Example Recovery path
transient read failure product API timeout before any write retry automatically
partial write CRM lead created, enrichment failed keep lead, mark enrichment pending
external side effect email sent, booking failed do not resend blindly; reconcile first
irreversible action payment, cancellation, public publish stop and require human review

The platform feature is a reminder that "retry the job" is not an operating model. It is only safe when the workflow knows which work already happened.

The mistake I see in small business automations

Most internal automations start as linear scripts:

  1. get the form submission;
  2. enrich the business;
  3. score the lead;
  4. write to the CRM;
  5. notify sales;
  6. send a customer email.

That sequence is easy to build and hard to recover. If step four succeeds and step five fails, should the system create another CRM record on retry? If the customer email sent but the notification failed, should the retry send a second email? If enrichment timed out, should the lead wait, or should sales receive it with a visible "enrichment pending" state?

Those answers should not live in someone's memory. They should be part of the workflow contract.

A recovery contract checklist

For each automation that touches customers, money, inventory, publishing, or sales operations, document five things.

What is the durable record?

Every workflow needs one durable record that says what the system believes happened. That may be a database row, CRM object, job table, order record, or content run.

It should carry stable identifiers:

  • workflow id;
  • source event id;
  • customer or account id;
  • external object ids created by each step;
  • current state;
  • last successful step;
  • next safe action.

Without this, support has to reconstruct the incident from logs and vibes.

Which steps are idempotent?

Idempotent means a step can run twice without creating duplicate side effects. Reads are usually safe. Writes are safe only when they use stable keys.

Good patterns:

  • create CRM leads with an idempotency key derived from the form submission id;
  • upsert enrichment data by domain or lead id;
  • store email provider message ids after sending;
  • reserve inventory with a reservation id;
  • publish content with a unique slug and run id.

Bad patterns:

  • "create a new record" on every retry;
  • "send email" without recording the provider id;
  • "push latest content" without checking whether the target slug already exists;
  • "charge card" without an idempotency key.

What does rollback actually mean?

Rollback is not always undo. Sometimes undo is impossible or undesirable.

For a local-service booking funnel, rollback might mean releasing a reserved appointment slot. For ecommerce, it might mean cancelling an unpaid draft order. For a publishing pipeline, it might mean reverting a commit or marking a post draft. For a lead pipeline, it might mean leaving the CRM record in place but adding an error status so a human can complete the task.

The important part is to name the recovery action instead of treating every failure as a generic exception.

When should humans be pulled in?

Automation should not pretend every recovery can be automatic. Define escalation conditions:

  • external side effect happened but confirmation is missing;
  • retry limit exceeded after a write;
  • two systems disagree about state;
  • customer-facing communication may have duplicated;
  • a publish action succeeded in the source repo but not on the live site;
  • credentials, permissions, or secrets failed during the run.

The escalation should include enough context for a human to decide quickly: the ids, the last successful step, what was attempted, what is safe to retry, and what should not be repeated.

What will the customer or operator see?

A failed workflow should not leave people staring at a spinner or a silent inbox.

For user-facing flows, show a state that matches reality:

  • "We received your request and are checking one detail."
  • "Your booking is pending confirmation."
  • "Payment did not complete; no charge was made."
  • "Your report is queued; we will email it when ready."

For operator-facing flows, show the recovery state:

  • lead_created_enrichment_pending;
  • booking_reserved_notification_failed;
  • post_pushed_live_verification_failed;
  • invoice_created_email_failed.

Those states are less elegant than a green success toast, but they are much easier to operate.

Where this matters most

The recovery contract is especially important for automations that feel boring until they fail.

Lead and quote funnels

A quote form that enriches a business, scores the lead, creates a CRM deal, and sends an email should be able to resume from the CRM deal id. It should not create duplicate deals because enrichment timed out.

Booking workflows

A booking flow must distinguish between "slot checked", "slot reserved", "booking confirmed", and "notification sent". If the confirmation email fails, the booking may still be valid.

Content publishing

A content job has two separate acceptance criteria: source publication and live-site publication. A pushed commit is not proof that the public page updated. The recovery state should say whether the repo push, webhook, build, cache revalidation, and live URL verification succeeded.

AI agents and tool use

Agentic workflows add another layer. If an agent can call tools, the workflow needs to know which tool calls were offered, approved, started, completed, failed, or rolled back. The transcript is useful context, but it is not the recovery ledger.

A simple implementation pattern

You do not need a large orchestration platform to adopt the discipline.

Start with a workflow table:

id
kind
source_event_id
state
last_successful_step
external_ids
attempt_count
next_retry_at
requires_human_review
error_summary
created_at
updated_at

Then define each step with four fields:

step_name
idempotency_key
on_success
on_failure

For write steps, store the external id before moving on. For notification steps, store the provider message id. For publish steps, store both the source commit and live verification result.

That gives you a system that can answer the only question that matters during an incident: what happened, and what is safe to do next?

The practical takeaway

Cloudflare's rollback-context update is not only a Workflows feature. It is a useful design prompt for every automation that has side effects.

Retries make a system persistent. Recovery contracts make it trustworthy.

Before adding another AI tool, webhook, enrichment provider, or publishing job, write down the recovery contract. Decide what can safely retry, what should roll back, what must be reconciled, and when a human should take over. That is the difference between automation that merely runs and automation a business can actually depend on.

Need technical help?

I'm a software engineer who builds web apps, APIs, and AI tooling. If you've got a project or a problem to talk through, book a free 30-minute call.

Book time with me ->