Schema and rich snippets

Robots.txt and Your AI Buyer: Are You Blocking the Crawlers That Cite You?

Some stores quietly block the very bots that feed answer engines. How to audit robots.txt for AI crawlers without opening the floodgates.

Updated 2026-06-017 min

Why would my robots.txt block AI crawlers at all?

Rarely on purpose. Most blocks arrive by inheritance: a security plugin, a CDN preset, a hosting default, or a copied template that added Disallow rules for AI bots during the 2023 wave of "keep my content out of training" advice. The rule sits in the file for years and nobody revisits it.

The intent back then was to keep your content out of model training. The side effect now is that the same rule can keep your reviews out of the live citations that answer engines surface to a buyer mid-purchase. Training access and answer access are not the same thing, but a single Disallow line often blocks both.

Which AI crawler user-agents should I look for?

Different engines send different bots, and blocking one does not block the others. Open your robots.txt and scan the User-agent lines for the named AI crawlers, then read the Disallow rules attached to each block.

A rule under one user-agent applies only to that agent, so a file can welcome Google search while quietly shutting out the bot that feeds an AI Overview.

  • GPTBot and OAI-SearchBot, the OpenAI crawlers behind ChatGPT search.
  • ClaudeBot and Claude-User, used by Anthropic.
  • PerplexityBot, which fetches pages Perplexity cites.
  • Google-Extended, which governs Gemini and AI Overviews separately from ordinary Googlebot.

Does blocking a crawler really mean I lose the citation?

Often, yes. Answer engines lean on crawlers they control to read the live page, and robots.txt is the gate that crawler obeys. If the gate is shut, the engine has no readable copy of your reviews to quote, and you fall out of the answer for any buyer who never types your brand name.

The honest caveat: robots.txt is a request, not a wall. Compliant crawlers honour it, and the major AI bots generally do, so a Disallow reliably costs you with them. It will not stop an actor that ignores the file, and it is not a security control. The point here is narrower: if you want to be cited, do not block the bots that do the citing.

How do I audit robots.txt without opening everything up?

You are not choosing between total lockdown and total exposure. The audit is about removing the accidental blocks on the few crawlers that feed buyer-facing answers, while keeping any blocks you set deliberately.

Work through it in order, and change one thing at a time so you can see what each edit does.

  • Read yourstore.com/robots.txt and list every User-agent block and its Disallow rules.
  • Flag any Disallow that targets GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, or Google-Extended.
  • Decide per bot: a deliberate block stays; an inherited one you never chose comes out.
  • Confirm your review pages and product URLs are not caught by a broad Disallow path.

I unblocked the crawlers. Why are my reviews still not cited?

Because access is necessary, not sufficient. Letting the crawler in only matters if there is something readable for it to take once it arrives. A store can welcome every AI bot and still go uncited because its reviews render inside a JavaScript widget the crawler sees as an empty container.

Most review apps were built for the on-page shopper and stop at the widget, which leaves the reviews unreadable to the very crawlers you just unblocked. Getting your existing reviews readable, corroborated, and cited in search and AI is the gap BetterReviews is built to close. Robots.txt opens the door; readable review HTML is what the engine actually carries back.

How is this different from a noindex tag or an llms.txt file?

They control different things. Robots.txt governs whether a crawler may fetch a page at all. A noindex tag lets the crawler fetch but asks the engine not to index, so it is a softer signal and only seen after the fetch is allowed. An llms.txt file is a separate proposal that points models at preferred content and has no power to block anyone.

For AI citation, robots.txt is the load-bearing one. If it disallows the crawler, nothing downstream matters, because the engine never gets to read the page, let alone weigh a noindex tag or an llms.txt hint.

What this adds up to

Robots.txt is a five-minute check with outsized stakes. A single inherited Disallow can remove your store from AI answers without anyone noticing, and the fix is usually deleting a line you never meant to add. Audit the AI crawler user-agents, keep the blocks you chose, drop the ones you inherited, then make sure the reviews behind the door are actually readable once the crawler walks through it.

Yes
Answer engines reach your reviews through crawlers that robots.txt can allow or block
AEO research synthesis, 2025
Block = invisible
Blocking AI crawlers can mean exclusion from AI-answer citation
AEO research synthesis, 2025
~5 min
Auditing robots.txt user-agent rules is a quick, high-leverage check
AEO research synthesis, 2025
Common questions
Where do I find my robots.txt file?
It lives at the root of your domain: type yourstore.com/robots.txt into a browser. If the page is blank or returns a 404, you have no rules in place, which means you are not blocking AI crawlers, though it also means you are relying entirely on platform defaults.
Will unblocking AI crawlers let my content be used for model training?
Possibly, and that is a real trade-off to weigh. Some crawlers fetch for live answers, some for training, and a few user-agents do both. If training use concerns you, allow the answer-facing bots (such as OAI-SearchBot and PerplexityBot) while keeping training-oriented rules, rather than blocking everything and losing citations too.
Does robots.txt actually stop a crawler, or is it just a suggestion?
It is a request that compliant crawlers honour, and the major AI bots generally do. That makes it reliable for shaping who cites you, but it is not a security boundary. An actor that ignores the file can still read the page, so never treat robots.txt as access control for anything sensitive.
If I allow the crawlers, will my reviews get cited automatically?
No. Allowing the crawler only removes the block; the engine still needs readable review text to quote. If your reviews render inside a JavaScript widget the crawler reads as empty, you can be fully unblocked and still uncited. Access and readability are two separate jobs.