№ XXXII·On Crawlers·18 September 2026

The robots.txt that quietly blocks your AI buyer.

Shopify's default robots.txt lets Googlebot in. It says nothing about GPTBot, ClaudeBot, PerplexityBot, or Google-Extended. In 2026, the default is the wrong answer.

BetterReviews Editorial·Studio note

CONTENTS · 07

01What "fall through to default" actually means
02What the AI crawlers actually want, in 2026
03What a 2026-correct robots.txt looks like
04What Search Engine Land found, May 2025
05The two-line check that almost nobody runs
06Defaults are not neutral
07The closing turn

The default robots.txt that ships with a new Shopify store, as of May 2026, contains 47 lines. It blocks crawlers from the cart, the checkout, the admin, the search results page, and a handful of dynamic preview routes. It explicitly names two bots: Googlebot and AhrefsBot. Googlebot is allowed broad access. AhrefsBot is given a narrower set of rules. Every other crawler, named or unnamed, falls through to the default `User-agent: *` block.

What the default file does not contain, anywhere, is a rule for GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, Google-Extended, or Applebot-Extended. Three of those are the bots that OpenAI, Anthropic, and Perplexity dispatch to read pages they intend to summarise in an answer. The other three are the user-agent strings that opt a site in or out of AI training corpora. None of them is named in Shopify's default file. None of them is explicitly allowed. None of them is explicitly blocked.

The ambiguity is the problem.

What "fall through to default" actually means

The robots.txt protocol is older than the modern web. The specification, written by Martijn Koster in 1994, is short and forgiving. A crawler announces itself with a `User-agent` string. The site's robots.txt file lists rules per user agent, with a fallback wildcard. If the crawler matches a specific rule block, it follows that block. If it does not, it follows the wildcard block. If no wildcard exists, it generally assumes everything is allowed.

In Shopify's default file, the wildcard `User-agent: *` block disallows a handful of routes (cart, checkout, admin, etc.) and allows everything else. By the protocol's logic, an unnamed crawler like GPTBot is allowed to crawl the storefront. In practice, all five major AI crawlers (GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, CCBot) currently respect robots.txt and treat the wildcard as permission. The crawl happens. The pages are read.

This is the configuration most operators believe they are in. It is not, however, the configuration most operators are actually in. Three failure modes are common enough to be worth naming.

The first is when a site ships a custom robots.txt that overrides the Shopify default with a more restrictive wildcard. The most-copied template on the open web (a popular SEO blog post from 2019, still referenced by Shopify themes) ends with `User-agent: * Disallow: /` for "any unrecognised bot." That single line blocks every AI crawler that has shipped since.

The second is when a Cloudflare or Vercel WAF rule blocks unknown user agents at the edge, before they ever read robots.txt. Cloudflare, in particular, ships a "Block AI bots" toggle that has been on by default for new sites since mid-2024. The toggle blocks GPTBot, CCBot, ClaudeBot, and a growing list of others at the network layer, regardless of what robots.txt says. The store owner often does not know the toggle exists, because it was set at the Cloudflare account level by the agency that built the site.

The third is when a security-minded admin, often a developer, has added rate-limiting or user-agent filtering rules that incidentally block AI crawlers. The rules were written to block scrapers or competitive intelligence bots. They catch GPTBot in the same net.

The result, in all three cases, is the same: the brand is invisible to the engine its future customers are using to find brands like theirs. The brand does not know. The crawlers do not announce themselves to the marketing team.

What the AI crawlers actually want, in 2026

Five user-agent strings are worth knowing by name. They behave differently and they want different things.

`GPTBot` is OpenAI's training crawler. It reads pages to add them to OpenAI's training data and, increasingly, to populate the index that ChatGPT's search feature queries in real time. OpenAI documents the bot's behaviour publicly and respects robots.txt directives. Blocking GPTBot removes the site from OpenAI's training data for future models and from ChatGPT's search index. Most brands should allow it.

`OAI-SearchBot` is OpenAI's separate crawler for ChatGPT's live search feature, distinct from the training crawler. Allowing GPTBot does not automatically allow OAI-SearchBot; they are two named agents. A site that allows GPTBot but blocks unknown user agents (default-deny) will block OAI-SearchBot by accident, and ChatGPT search will not include the site in its real-time results.

`ClaudeBot` is Anthropic's crawler. Same pattern. Anthropic documents the bot, respects robots.txt, and uses the content to support Claude's responses, including any tool-use that fetches pages live during a conversation.

`PerplexityBot` is Perplexity's crawler. It fetches pages to support Perplexity's answer-engine output, where citation share is unusually visible (Perplexity shows the cited URLs inline with the answer). Blocking PerplexityBot is the most directly costly of the five, because Perplexity's citation surface is the most exposed to end users.

`Google-Extended` is the strange one. It is not a crawler. It is a directive that controls whether Google may use the content already fetched by Googlebot to train its Gemini models and improve its AI Overviews features. A site that allows Googlebot for normal indexing can still opt out of Google's AI training by adding a `User-agent: Google-Extended Disallow: /` rule. A site that wants its content to inform AI Overviews citation should allow Google-Extended, which is the implicit default if no rule is specified. The same logic applies to `Applebot-Extended` for Apple Intelligence.

The five names should be in any DTC store's robots.txt in 2026, with explicit allow rules for the ones the brand wants reading and explicit deny rules for any the brand does not. Defaulting to the wildcard is no longer good enough, because the wildcard is increasingly intercepted at the CDN edge by tools the operator did not configure.

What a 2026-correct robots.txt looks like

For a DTC store that wants its content cited by AI answer engines (which, by any reasonable read of the citation economy, is every DTC store), the rules below belong in robots.txt above the wildcard block. The example assumes a Shopify storefront, with the standard disallows preserved.

``` User-agent: GPTBot Allow: /

User-agent: OAI-SearchBot Allow: /

User-agent: ChatGPT-User Allow: /

User-agent: ClaudeBot Allow: /

User-agent: Claude-Web Allow: /

User-agent: PerplexityBot Allow: /

User-agent: Perplexity-User Allow: /

User-agent: Google-Extended Allow: /

User-agent: Applebot-Extended Allow: /

User-agent: CCBot Allow: /

User-agent: * Disallow: /admin Disallow: /cart Disallow: /checkout Disallow: /orders Disallow: /policies Disallow: /account Disallow: /search Allow: / ```

The block above is explicit, in 2026, in a way that defaults are not. Each AI crawler is named, allowed, and parameterised. The wildcard at the bottom catches everything else, with the standard Shopify disallows preserved. Brands that need to be stricter (luxury brands hesitant about training corpora, brands with copyright-sensitive product imagery, brands with active legal posture against AI scraping) can swap the relevant `Allow: /` for `Disallow: /`, with the trade-off that the citation will go elsewhere.

The `ChatGPT-User` and `Claude-Web` agents are worth a separate note. They are not crawlers in the traditional sense. They are the on-demand fetchers triggered when an end user, mid-conversation with ChatGPT or Claude, asks the model to "go look at this page." A site that blocks them blocks every user-initiated fetch from inside an AI chat session, which is the moment a buyer is closest to a purchase decision. (For why the moment matters, see the engine the answer engine reads.)

`Perplexity-User` is the analogue for Perplexity. Same logic. The user has asked Perplexity to fetch the page, in real time, in the middle of a research session. The fetch should succeed.

What Search Engine Land found, May 2025

Reachability funnel · enterprise storefronts

All enterprise sites sampled100%

n = 1,000

Allow AI crawlers in robots.txt77%

23% block, deliberately or by inheritance−23%

Actually reachable by the crawler58%

19% ship JS-only HTML the crawler cannot render−19%

The final reachable fraction is what the engine cites

Of a thousand enterprise sites, only about three in five are actually visible to AI crawlers. The other two are blocked or empty by the time the bot reads the page.Search Engine Land, May 2025 · n=1,000

A May 2025 Search Engine Land analysis of 1,000 enterprise sites found that 23% of sampled sites either blocked GPTBot explicitly, blocked all unknown user agents, or had a CDN-level rule that intercepted AI crawlers. Of those, fewer than 4% had made a deliberate decision to block; the rest had inherited the configuration from a template, a security default, or a CDN setting they did not know about.

The same study noted that, of the 77% of sites that allowed AI crawlers in principle, roughly 19% were serving content via client-side JavaScript that the crawlers could not render. The crawler was allowed in, fetched the page, received a near-empty HTML shell, and indexed nothing of substance. Two failure modes, layered on top of each other, with the same downstream consequence: invisibility.

For a DTC store, the two failure modes compound. A brand that has spent eighteen months collecting customer reviews and is paying a review platform six figures a year is, in many cases, shipping those reviews exclusively through a client-rendered widget on a page that may or may not be blocked at the CDN edge. The reviews exist. They are not visible to GPTBot. They are not citable by Perplexity. They are, for the purpose of the citation economy, dead text.

The two-line check that almost nobody runs

The diagnostic is small. Three commands, run from a terminal, against the brand's own domain.

``` curl -A "GPTBot/1.0" -I https://www.brandname.com/products/example curl -A "ClaudeBot/1.0" -I https://www.brandname.com/products/example curl -A "PerplexityBot/1.0" -I https://www.brandname.com/products/example ```

Each command should return HTTP 200. If any returns 403, the CDN is blocking the crawler at the edge, regardless of what robots.txt says. If any returns 200 but the response body is an empty HTML shell, the page is allowed but the content is not server-rendered.

The same check, run against robots.txt directly:

``` curl -A "GPTBot/1.0" https://www.brandname.com/robots.txt ```

Read the file. Look for explicit rules for the named agents above. If they are absent, the brand is relying on the wildcard, which is increasingly unsafe. If they are present and set to `Disallow: /`, the brand is opted out of the citation economy by configuration. Either way, the check takes three minutes and almost no operator has done it. (For the broader argument about what the engines are doing with what they find, see the citation economy.)

Defaults are not neutral

The deeper point about the 2026 default robots.txt is that it was written for a different web. Shopify's default file was drafted in an era when "crawlers" meant Google, Bing, and a handful of SEO tools. The file's defaults are good defaults for that era. They are not good defaults for an era in which the most commercially important fetches against the storefront are coming from a set of user agents that did not exist five years ago.

The same is true of every CDN's default rules, every WAF's default templates, every theme's default head tags. The defaults were not written with AI crawlers in mind. They are, increasingly, hostile to discovery by accident. The brand that audits the defaults, names the new crawlers explicitly, and ships a robots.txt that says yes to the agents it wants reading is in a different position than the brand that inherited a 2019 template and never looked again.

The closing turn

In a year, the defaults will catch up. Shopify will ship explicit allow rules for the major AI crawlers, the CDNs will update their templates, the SEO plugins will start writing the right blocks. By then, the brands that updated their robots.txt in 2026 will have eighteen months of citation share that the late adopters will not. The work, today, is small: one file, ten user agents, twenty lines. The brands that do it now look like they have always been visible to the engines. The brands that wait look, for a while, like they are not there at all. (See what the search engine became for what "being there" means in this period.)

If any of this reads like something your store could use,write to us.

We will write back.

Corrections

corrections@better-reviews.com

Mistakes are listed at the foot of the page when found.