The Free Buffet for Bots

AI crawlers strip-mine your content for free, and most sites never see them arrive.

The crack

Publishers spend years and real money building content. AI crawlers read all of it, at scale, to train and to answer questions elsewhere. The publisher gets no signal, no say, and no return.

The accepted wisdom is that this is simply how the open web works. You put content up, bots take it, and a robots.txt file politely asks them not to. That file is an honor system, and honor is in short supply.

Why it persists

Most origins cannot tell a person from a crawler at request time, so they treat all traffic the same. By the time server logs are analyzed, the content is already gone. Without a control point at the edge, there is nothing to enforce.

The fix on Cloudflare

Cloudflare sits in front of the request, so it can classify the caller before the origin ever responds. Known AI crawlers can be identified by signature and then handled by policy: allow them, charge them, or block them.

This turns a silent leak into a decision. A site owner gets to choose the relationship with each crawler instead of discovering it after the fact.

How I built the demo

A Pages Function inspects the User-Agent and returns its classification: is this an AI crawler, who operates it, and what policy applies. The demo can simulate different crawlers so you can watch the verdict change.

An enforce flag makes it real rather than descriptive. With it set, the endpoint returns the actual policy response, including a genuine HTTP 402 for a crawler that should be paying. The interviewer sees enforcement, not a slide about enforcement.