Top AI Crawlers in 2026: A Bot-by-Bot Guide

Most of the traffic hitting your site is no longer human. On June 3, 2026, Cloudflare CEO Matthew Prince shared Cloudflare Radar data showing automated traffic at 57.5% of requests for web content, ahead of humans at 42.5%. That is the first time bots have led in the history of the web, and roughly eighteen months earlier than Prince had forecast for the end of 2027. A large and growing share of those bots are AI crawlers, and a list of their names is the easy part. The part that decides whether they help or hurt you is what each one is actually doing on your pages.

This is a current, source-checked guide to the top AI crawlers in 2026. Instead of dumping every user-agent string into one pile, it sorts each bot the way citAEOtion sorts them: by purpose. That is the only grouping that tells you what to do next.

Sort crawlers by purpose, not by name

A name like GPTBot tells you who owns the bot. It does not tell you whether that bot is training a model on your content, deciding whether to cite you in an AI search result, fetching you live to answer one user's question, or just taking your pages with no intent to send anyone back. Those are four different outcomes for your business, and they call for four different responses.

citAEOtion reads your server logs and sorts every crawler into four categories: AI Training, AI Search, AI Assistant, and Data Scraper. The classification comes from what the bots actually did on your site, not from asking a model what it thinks of you. Training crawlers are the leading indicator. When a model trains on your content, the search and assistant citations tend to follow months later. So the bot list below is organized inside those four buckets, because that order is the one that maps to action.

AI Training crawlers

These bots pull content into model training pipelines. They tend to consume large volumes of pages and revisit often. If you want a say in whether your content shapes the next model, these are the ones to recognize first. On Vercel's network, GPTBot alone made about 569 million requests in a single month and ClaudeBot about 370 million, so the training tier is where most of the raw crawl pressure lives.

GPTBot (OpenAI)

GPTBot is the primary crawler that feeds web content into OpenAI's models behind ChatGPT and the API. It declares a clear user-agent string, and OpenAI publishes its crawler list so you can allow or block it through robots.txt. The reach behind it is real: OpenAI reported more than 900 million weekly active ChatGPT users as of March 2026. In 2026 crawl-volume rankings, GPTBot sits near the top of the AI pack, trading places with Meta's crawler month to month.

ClaudeBot (Anthropic)

ClaudeBot collects web content for training Anthropic's models, the ones that power Claude. It is a training crawler, not the live-fetch bot many people assume it is. Anthropic now runs separate agents for separate jobs: ClaudeBot for training data, and Claude-User for fetching pages live when someone asks Claude a question. Each has its own user-agent string, so you can make a different call for each. Block ClaudeBot and you opt out of training. Block Claude-User and you can disappear from answers Claude gives real users.

GoogleOther (Google)

GoogleOther is Google's general-purpose crawler for non-search work, including data pulls for AI and internal research. It is separate from standard Googlebot, which handles search indexing. Google also offers Google-Extended, which is not a separate crawler but a robots.txt token that lets you opt out of having your content used to train Gemini. GoogleOther respects robots.txt.

Amazonbot (Amazon)

Amazonbot crawls the web for Amazon's indexing and AI work, including data that can feed model training and product knowledge. It identifies itself with a documented user-agent string and follows robots.txt.

PetalBot (Huawei)

PetalBot serves Huawei's Petal Search and AI features. It crawls both desktop and mobile pages to build its index and feed Huawei Assistant and AI search. It follows robots.txt, though some operators rate-limit it for crawling large sites aggressively.

AI Search crawlers

These crawlers index your pages so an AI engine can decide whether to cite you when it answers a search. They are the bridge between training and live answers. Meta's crawler is the clearest example of how blurry this tier can be.

Meta-ExternalAgent (Meta)

Meta-ExternalAgent serves mixed purposes: collecting training data for Meta's models, including Llama, and retrieving content for Meta AI across Facebook, Instagram, and WhatsApp. It is one of the highest-volume AI crawlers on the open web. In 2026 crawl-volume rankings it has run neck and neck with GPTBot for the number-two spot behind Googlebot, a striking position for a company built on closed platforms. It publishes a user-agent string you can match in your logs.

AI Assistant crawlers

These bots fetch your pages live to answer a user's question in the moment. Blocking them can quietly remove you from assistant replies. This is also the tier where good behavior gets murky.

PerplexityBot and Perplexity-User (Perplexity AI)

Perplexity runs crawlers to power its answer engine and its Comet agentic browser, fetching pages to build cited summaries. The behavior is the controversy. In August 2025, Cloudflare reported that Perplexity continued to access content from sites that had blocked it, rotating IP addresses and impersonating a normal Chrome-on-macOS browser to slip past robots.txt and firewall rules. Cloudflare de-listed it as a verified bot and added rules to block the stealth crawling. So Perplexity wants to be in the assistant tier, but on a site that has told it no, it can behave like a scraper. That gap is exactly the kind of thing a name-only list hides and a real log record exposes.

Data Scrapers and stealth crawlers

This bucket is everything else that takes your content, with attribution optional and rules optional. Two patterns dominate in 2026.

Grok (xAI)

xAI documents user agents such as GrokBot, xAI-Grok, and Grok-DeepSearch, but in practice almost no server traffic shows up using them. Reports indicate Grok rotates residential IP addresses and spoofs ordinary Safari, Chrome, and iPhone user-agent strings, which makes its fetches look like a human visitor and robots.txt close to useless against it. From your side, a Grok fetch often arrives with no AI signal to act on at all.

Agentic browsers

A newer category grew fast through 2025 and 2026: agents that browse on a user's behalf, click, and complete tasks rather than just index. The list includes ChatGPT Agent and ChatGPT Atlas from OpenAI, Claude for Chrome from Anthropic, Microsoft Copilot actions, Google's agent work, and Perplexity Comet. Many of these present generic browser user agents, so they rarely show up as a clean bot string in your logs. Catching them takes pattern analysis, not a robots.txt line.

What a robots.txt list can and cannot do

A maintained robots.txt is still worth keeping. For the well-behaved bots, GPTBot, ClaudeBot, GoogleOther, Amazonbot, PetalBot, Meta-ExternalAgent, it is a real control. Add the user-agent tokens you want to allow or disallow and most of those crawlers will honor it.

The catch is that robots.txt is an honor system, and a growing share of AI traffic does not play along. Perplexity has been caught ignoring it. Grok never really declares itself. Agentic browsers wear a normal browser's clothes. For that traffic, the file does nothing. The only thing that catches it is the record of what hit your server, by IP, by user agent, by page, by timestamp. Two more facts from Vercel's network data make the point: the major AI crawlers do not render JavaScript, and ChatGPT and Claude crawlers spend more than a third of their requests on 404 pages. You learn that kind of thing from logs, never from a model.

Read the bots instead of guessing

The reason to sort crawlers by purpose is that the sort tells you what to change. Heavy training crawls today are a signal you are being learned, which tends to precede citations. A scraper hammering your pricing page is a cost to throttle. An assistant bot fetching you live is visibility to protect. Lump them into one number and you learn nothing. Separate them and you can see your real position in the AI ecosystem and act on it.

citAEOtion does this from real server logs, not prompt guesses. It names every known AI crawler, sorts each one into AI Training, AI Search, AI Assistant, or Data Scraper, shows page-level hit counts, tracks the trend as the bot mix shifts, and keeps its classifier current as new crawlers appear. It installs as a WordPress plugin in about five minutes. The thesis is simple: the GA of AI. Full data. No BS. The goal is not to rank in a guess. It is to become the answer, measured on evidence instead of vibes.

Knowing the names is the start. Knowing what each bot did on your pages is the whole game. Start tracking your real AI crawler traffic, or see how it works first.

Frequently asked questions

What are the top AI crawlers in 2026?

The most active AI crawlers in 2026 include GPTBot from OpenAI, ClaudeBot from Anthropic, Meta-ExternalAgent from Meta, GoogleOther from Google, Amazonbot, PetalBot, and PerplexityBot. In 2026 crawl-volume rankings, Googlebot leads, with Meta-ExternalAgent and GPTBot trading the next two spots, ahead of others like ClaudeBot and Bytespider.

How do AI crawlers identify themselves?

Most well-behaved AI crawlers send a dedicated user-agent string, and companies like OpenAI, Anthropic, Google, and Meta publish those strings so you can spot them. Others do not cooperate. Grok spoofs ordinary browser and iPhone user agents, and agentic browsers often arrive with generic browser strings, so a user-agent check alone misses them.

Do AI crawlers respect robots.txt?

Some do and some do not. GPTBot, ClaudeBot, GoogleOther, Amazonbot, PetalBot, and Meta-ExternalAgent honor robots.txt directives. Perplexity was reported by Cloudflare in 2025 to keep crawling sites that had blocked it by rotating IPs and disguising its user agent, and Grok largely sidesteps the file by not declaring itself. For those, robots.txt is not enough on its own.

What is the difference between a training crawler and an assistant crawler?

A training crawler like GPTBot or ClaudeBot pulls your content to help build a model. An assistant crawler fetches your page live to answer one user's question right now, the way Claude-User or Perplexity does. Blocking a training crawler opts you out of model training. Blocking an assistant crawler can remove you from the live answers that engine gives users, which is a different and often costlier trade.

How can I track which AI crawlers visit my site?

Start with your server access logs and filter by known AI user-agent strings. For a clearer picture, the citAEOtion WordPress plugin reads real crawler activity and sorts each bot by purpose, so you can see which pages pull the most training crawlers, whether an assistant is fetching you live, and which scrapers are burning your bandwidth, all from evidence rather than guesswork.