What we can and can't verify

iscrawlable runs a public, unauthenticated crawler-readiness check across robots.txt, HTTP responses, indexability headers, sitemap, llms.txt, and WAF/CDN signals. Some answers a public scan can give with high confidence. Others need a connected scan with read-only access to your CDN. This page lays out which is which.

Public scan vs connected scan

A public scan probes your site the way an external crawler would. It cannot see configuration that lives behind your CDN dashboard — for example, the Block AI Bots toggle in Cloudflare's AI Crawl Control panel, or custom WAF rules that match on attributes a public probe cannot reproduce.

A connected scan asks for a read-only API token from your CDN provider so we can read those settings directly. We use it only to read configuration, never to change it. Connected scan is a Pro feature.

User-agent simulation, not source IP

Our public scan sends requests with the published user agent strings of major AI crawlers. We do not originate from the IP ranges those crawlers actually use, and we do not impersonate verified-bot identities. Sites that gate access by source IP or by verified-bot signature may treat our probe differently from a real crawler. That is a known limit of any public crawler-readiness check.

Verified bot IP limits

OpenAI, Anthropic, Perplexity, and Google publish IP ranges for some of their crawlers. We compare the public response a site returns to our probe against the documented behavior of those crawlers, but we cannot fully replicate IP-based allow-lists from outside the perimeter. If your access policy depends on IP attestation, a connected scan is the only way to verify the rule end-to-end.

Cloudflare connected scan

If you connect Cloudflare with a read-only API token, we can additionally inspect:

Block AI Bots toggle state
AI Crawl Control rule set
Managed robots.txt overrides
Bot Fight Mode setting
Custom WAF rules that match AI crawler user agents

What we still cannot see, even with a Cloudflare token: rules at the origin server level (nginx / Apache / application code), and policies at any other layer in front of Cloudflare. We also do not modify any settings — this scan is read-only by contract.

Perplexity declared-crawler caveat

We only check declared Perplexity user agents and public access signals. We cannot verify stealth, third-party, or undeclared crawlers from a public scan. Perplexity's user-triggered agent (Perplexity-User) is shown as auxiliary context only and never counts against the main pass/fail.

What each result status means

Pass: Allowed by the public checks we can verify.
Fail: Blocked or disallowed by at least one public signal.
Warning: Mixed signals — one layer looks open, another looks restricted or ambiguous.
Unknown: We could not verify this from public checks.
Needs connected scan: Public checks are not enough. Connect Cloudflare read-only access to inspect WAF and AI Crawl Control settings.

Pass means crawlers appear allowed by public checks. It does not guarantee citation in ChatGPT, Claude, Perplexity, or Google AI results.

What we can and can't verify

Public scan vs connected scan

User-agent simulation, not source IP

Verified bot IP limits

Cloudflare connected scan

If you connect Cloudflare with a read-only API token, we can additionally inspect:

Block AI Bots toggle state

AI Crawl Control rule set

Managed robots.txt overrides

Bot Fight Mode setting

Custom WAF rules that match AI crawler user agents

What each result status means

Pass

Allowed by the public checks we can verify.

Fail

Blocked or disallowed by at least one public signal.

Warning

Mixed signals — one layer looks open, another looks restricted or ambiguous.

Unknown

We could not verify this from public checks.

Needs connected scan

Public checks are not enough. Connect Cloudflare read-only access to inspect WAF and AI Crawl Control settings.