Field Guide

AI Crawler Checker: Reads Your Live robots.txt, Bot by Bot

By Jerome Bilaos·Technical Web Architect·Updated June 2026

I built this tool to answer one narrow, mechanical question: when an AI crawler reads your robots.txt, is it told it can come in, or not? Not "are you visible to AI" — that is a bigger question with a dozen moving parts. Just access. The first gate. If you fail it, nothing downstream matters.

Here is exactly what happens when you paste a domain.

What it checks

The tool fetches your live robots.txt from the root of the domain you give it, parses it, and then evaluates thirteen named AI user-agents against the rules it finds. The list is the one that actually matters in 2026:

GPTBot and OAI-SearchBot (OpenAI), plus ChatGPT-User for in-session browsing
ClaudeBot, anthropic-ai, and Claude-Web (Anthropic)
PerplexityBot (Perplexity)
Google-Extended (Google's AI training/Gemini opt-out token, separate from Googlebot)
Applebot-Extended, Amazonbot, CCBot (Common Crawl), Bytespider (TikTok), Meta-ExternalAgent

For each one it returns allowed, blocked, or partial, and a count of how many are open versus closed.

Why this is different from eyeballing your robots.txt yourself

You can open yourdomain.com/robots.txt in a browser. The problem is that reading it correctly is harder than it looks, and most "is my site blocked" checkers cut corners in ways that produce wrong answers.

This tool follows the real precedence rule a compliant crawler uses. It looks for a group that names the bot's own user-agent first. Only if there isn't one does it fall back to the catch-all User-agent: * group. That distinction trips people up constantly. A site can have a wide-open * group and a buried User-agent: GPTBot block three lines down — and GPTBot obeys the specific block, not the generous default. A naïve checker that only reads * would tell you everything is fine. This one tells you GPTBot is blocked, because it is.

It also handles the cases that matter:

An empty Disallow: means "allow everything" — not a block. The tool treats it correctly.
A Disallow: / with an explicit Allow: / override is reported as partial, not a hard block, because that override genuinely changes the outcome.
Some paths disallowed but not the root is partial — usually an admin folder, often exactly what you want.

The honesty I built in

The single most common mistake in this category of tool is reporting "blocked" when nothing is actually blocked. So the default is deliberately conservative: if there's no robots.txt, or it's empty or returns a 404, every bot is reported as allowed — because with no rules in place, that is the truth. No rules means open. The tool will never invent a block to scare you.

That cuts both ways. It also won't pretend to know things it can't see. Which brings me to the limits.

What it does not do — and you should know this

This tool checks access only. It reads the rulebook at your front door. It does not:

Execute a crawl. It doesn't fetch your pages, so it can't tell you whether a bot that is allowed in finds anything worth quoting.
Measure whether AI will actually cite you. Being allowed in is necessary, not sufficient. Thin pages, content hidden behind JavaScript, and weak authority all keep you out of answers even with a perfectly open robots.txt.
Check firewall or WAF-level blocks. Some security plugins and CDNs ban AI user-agents at the network layer, before robots.txt is ever consulted. That kind of block won't show up here, because it isn't in the file.
Crawl your whole site. It reads one file: the root robots.txt. That file governs the whole site, so one fetch is enough for access — but it's the access question only.

I'd rather tell you that plainly than imply the tool does more than it does.

What you'll see when you run it

Paste a domain, hit check, and within a second or two you get a list: each bot name, its user-agent token, and a coloured pill — green allowed, red blocked, amber partial. Above the list, a one-line summary: "yourdomain.com — 11 AI bots allowed, 2 blocked." Below it, a note telling you whether the result came from a live robots.txt or from the no-file-found default.

The result you want is mostly green. The result worth acting on is a red pill next to a bot you care about — or, worse, a Disallow: / under User-agent: * that's quietly blocking everything, Googlebot included.

Who should run this

Anyone who has recently rebuilt a site, migrated platforms, or installed a security plugin — those are the three moments an accidental block sneaks in. Also anyone who has heard "we should let the AI crawlers in" and wants to confirm, in ten seconds, whether they already have.

If it comes back clean, good — the door's open. Run the AI Crawler Checker to confirm. If you want to know what happens after the bots walk through the door — whether your pages are readable, quotable, and trusted across the whole site — that's the full AUDXY website audit.