<- blog

Security Scans Need Pipeline Design

Cloudflare scaled Security Insights 10x without new hardware. The builder lesson is to treat audits and scanners as pipelines, not cron jobs.

#security#reliability#cloudflare

Cloudflare's latest Security Insights scaling write-up is worth reading even if you do not use Cloudflare. The practical lesson is not "Kafka is fast" or "Go workers scale." It is that security checks, website audits, and configuration scanners become infrastructure the moment users expect them to be fresh.

Cloudflare says Security Insights went from roughly 10 scans per second to more than 120 scans per second, enabled automatic scanning for all free accounts and zones, and increased scan frequency across plans. The interesting part is that the team did it largely by changing pipeline behaviour: how work was batched, consumed, deduplicated, written, and rate limited.

Freshness is part of the product#

Cloudflare describes Security Insights as a system that scans accounts, zones, and DNS records to surface security risks and misconfigurations. Its Security Insights docs list checks across account settings, DNS records, TLS, Access, WAF, API Shield, Bot Management, client-side security, Turnstile, and Zero Trust.

That scope makes freshness matter. A dangling DNS record, missing HSTS policy, exposed origin, or newly discovered API endpoint is not a static report-card item. It is a state that can appear after a deploy, a domain change, a vendor migration, or a rushed production fix.

Before the scaling work, Cloudflare says scans could run every one to two weeks, newly introduced risks could remain undetected for up to two weeks, and many free accounts had to opt in. After the change, Cloudflare reports scans every seven days for Free, every three days for Pro and Business, and daily for Enterprise.

That is the product shift: the same underlying scanner feels very different when it becomes automatic and frequent. A weekly-or-better loop can influence behaviour. A report that might be two weeks stale is mostly archaeology.

The pipeline was the bottleneck#

The architecture Cloudflare describes is familiar: a scheduler publishes scan work to Apache Kafka, specialised Go checker services consume the work, and an internal API persists findings to Postgres. Apache's own Kafka documentation frames Kafka as an event streaming platform for publishing, subscribing to, storing, and processing streams of events. That is exactly the kind of substrate teams reach for when background work outgrows a simple queue.

But Cloudflare's bottlenecks were also familiar:

The easy answer would have been to add more hardware or more Kafka partitions. Cloudflare's useful answer was more surgical. The team batched messages, processed them concurrently inside checkers, moved slow scans away from fast scan paths, changed database writes to bulk operations, and adjusted deployment topology so write-heavy internal traffic did not compete with user-facing API traffic.

That is the lesson for smaller teams too. If your scanner is slow, the bottleneck is often not the check itself. It is the work envelope around the check: scheduling, fan-out, retries, writes, deduplication, and the API surface where results land.

Audit systems need backpressure by design#

Every audit-like product has the same failure mode: you add more checks, more users, more assets, and more automatic triggers, then the system spends more time managing the backlog than producing timely insight.

Cloudflare's article calls out the head-of-line blocking problem explicitly. Some scan messages were much slower than others, and in a partitioned event stream that can delay later work. Their mitigation was to separate slower paths and keep fast checks moving.

That maps directly to product-led scanners and website-audit tools. A Lighthouse run, DNS check, security-header check, sitemap crawl, Core Web Vitals lookup, structured-data parse, and screenshot capture do not have the same cost profile. If they all share one undifferentiated queue, the slowest class of work sets the user's perceived freshness.

A better design separates work by cost and urgency:

  1. Run cheap checks often.
  2. Put expensive browser or crawl work in its own lane.
  3. Make retries bounded and visible.
  4. Store partial results instead of waiting for one giant all-or-nothing report.
  5. Show the scan age per finding, not just the report date.

That last point is underrated. Users do not only need to know what is wrong. They need to know whether the system looked recently enough for the warning to be trusted.

Bulk writes are a product feature#

One of the less glamorous parts of Cloudflare's post is the Postgres write path. That is exactly why it matters. Scanners usually produce many small findings, and the naive path is to insert or update each one through the same API shape used by interactive users.

Cloudflare moved toward bulk ingestion for checker output. PostgreSQL's COPY documentation describes COPY as a way to move data between tables and files or client streams, and it is a reminder that databases often have different paths for operational writes and bulk loading.

You do not need Cloudflare scale to apply the principle. If a system produces findings in batches, design a batch write path early. It gives you room to deduplicate, compress, validate, and commit results as one unit. It also prevents internal worker traffic from looking exactly like user traffic to the rest of your app.

For a small product, this can be as simple as:

await saveScanRun({ siteId, startedAt, checks });
await upsertFindings(siteId, findings);
await markScanComplete(siteId, finishedAt);

The shape matters more than the syntax. Results should be tied to a scan run, findings should be upserted idempotently, and completion should be explicit. That makes retries safe and stale runs easier to detect.

What builders should copy#

The Cloudflare post is an infrastructure story, but the builder takeaway is practical. If your product audits websites, APIs, repositories, cloud configs, SEO, accessibility, or security posture, write down your freshness target before you write more checks.

A useful checklist:

Those are product questions as much as infrastructure questions. A security warning that arrives too late trains users to ignore the scanner. A fast scanner with clear freshness metadata can become part of their operating rhythm.

My take#

The most useful part of Cloudflare's scaling story is that it makes scan freshness visible as a design constraint. The team did not just make a background job faster. It changed who gets scanned automatically, how often findings are refreshed, and how much delay the product tolerates.

That matters beyond Cloudflare. AI-assisted website audits, browser-grade fetchers, SEO monitors, dependency scanners, and agent-run code review all have the same shape: collect signals, process them through uneven work, persist findings, and turn them into advice.

If you are building one of those systems, do not wait until the backlog is millions of events deep to treat it like a pipeline. Separate fast and slow work, make writes idempotent, preserve freshness metadata, and design for partial success. The scan is not the product. The timely, trusted finding is the product.

Need technical help?

I'm a software engineer who builds web apps, APIs, and AI tooling. If you've got a project or a problem to talk through, book a free 30-minute call.

Book time with me ->