What is the CCBot User-Agent string?

The CCBot User-Agent string is: CCBot/2.0 (https://commoncrawl.org/faq/)

Does CCBot respect robots.txt?

Yes, CCBot respects robots.txt directives.

CCBOT

Q: How do I block CCBot in robots.txt?

To control CCBot access, add the following to your robots.txt file: User-agent: CCBot Disallow: / # Common Crawl data is used to train many AI models. # Blocking CCBot reduces your content's presence in AI training sets.

LOW RISK🔍 SEARCH & AI CRAWLER

Common Crawl's open web archival bot — the largest open dataset of web content

ORGANIZATION

Common Crawl

FIRST SEEN

2011-01

RESPECTS ROBOTS.TXT

✓ YES

DOCUMENTATION

commoncrawl.org

—

DAILY VISITS

—

COUNTRIES ACTIVE

TRACKING

STATUS

—

LAST SEEN

📡 CCBOT USER-AGENT STRING

CCBot/2.0 (https://commoncrawl.org/faq/)

This is the User-Agent header sent by CCBot in HTTP requests. Use this to identify CCBot in your server access logs.

📋 ABOUT CCBOT

CCBot is the web crawler operated by Common Crawl, a non-profit organization that maintains the largest open repository of web crawl data. Since 2011, Common Crawl has been archiving the web and making the data freely available to researchers, companies, and developers. The Common Crawl corpus contains petabytes of web data spanning billions of pages.

Common Crawl's dataset has become foundational to modern AI. Many of the most prominent large language models — including GPT, Claude, LLaMA, and others — were trained in part on Common Crawl data. This makes CCBot's crawling decisions particularly significant: content crawled by CCBot may end up in the training data of multiple AI systems simultaneously.

NORAD.io tracks CCBot activity to help site operators understand their content's presence in the Common Crawl archive and, by extension, in AI training datasets. Blocking CCBot is one of the most effective single actions a site operator can take to reduce their content's use across multiple AI training pipelines.

🎯 HOW TO DETECT CCBOT

▸User-Agent starts with 'CCBot/2.0'
▸Crawl patterns are periodic — large batches rather than continuous crawling
▸Requests come from AWS IP ranges (Common Crawl runs on AWS)
▸Does not execute JavaScript or load external resources
▸Crawl data is publicly available at commoncrawl.org

🔄 CRAWL BEHAVIOR

Large-scale periodic crawling campaigns. Respectful crawl rates with robots.txt compliance. Crawls billions of pages per month in batch operations. Does not execute JavaScript.

PURPOSE

Builds an open, freely available archive of the web. Common Crawl data is used by researchers, startups, and major AI companies (including for training LLMs like GPT, Claude, and LLaMA) as a foundational training dataset.

🤖 ROBOTS.TXT CONFIGURATION

User-agent: CCBot
Disallow: /

# Common Crawl data is used to train many AI models.
# Blocking CCBot reduces your content's presence in AI training sets.

CCBot respects robots.txt directives. Add this to your robots.txt file at the root of your domain.

→ Complete Guide: robots.txt for AI Bots