CCBOT
LOW RISK🔍 SEARCH & AI CRAWLERCommon Crawl's open web archival bot — the largest open dataset of web content
📡 CCBOT USER-AGENT STRING
CCBot/2.0 (https://commoncrawl.org/faq/)
This is the User-Agent header sent by CCBot in HTTP requests. Use this to identify CCBot in your server access logs.
📋 ABOUT CCBOT
CCBot is the web crawler operated by Common Crawl, a non-profit organization that maintains the largest open repository of web crawl data. Since 2011, Common Crawl has been archiving the web and making the data freely available to researchers, companies, and developers. The Common Crawl corpus contains petabytes of web data spanning billions of pages.
Common Crawl's dataset has become foundational to modern AI. Many of the most prominent large language models — including GPT, Claude, LLaMA, and others — were trained in part on Common Crawl data. This makes CCBot's crawling decisions particularly significant: content crawled by CCBot may end up in the training data of multiple AI systems simultaneously.
NORAD.io tracks CCBot activity to help site operators understand their content's presence in the Common Crawl archive and, by extension, in AI training datasets. Blocking CCBot is one of the most effective single actions a site operator can take to reduce their content's use across multiple AI training pipelines.
🎯 HOW TO DETECT CCBOT
- ▸User-Agent starts with 'CCBot/2.0'
- ▸Crawl patterns are periodic — large batches rather than continuous crawling
- ▸Requests come from AWS IP ranges (Common Crawl runs on AWS)
- ▸Does not execute JavaScript or load external resources
- ▸Crawl data is publicly available at commoncrawl.org
🔄 CRAWL BEHAVIOR
Large-scale periodic crawling campaigns. Respectful crawl rates with robots.txt compliance. Crawls billions of pages per month in batch operations. Does not execute JavaScript.
Builds an open, freely available archive of the web. Common Crawl data is used by researchers, startups, and major AI companies (including for training LLMs like GPT, Claude, and LLaMA) as a foundational training dataset.
🤖 ROBOTS.TXT CONFIGURATION
User-agent: CCBot Disallow: / # Common Crawl data is used to train many AI models. # Blocking CCBot reduces your content's presence in AI training sets.
CCBot respects robots.txt directives. Add this to your robots.txt file at the root of your domain.
🗺️ WHERE IS CCBOT ACTIVE?
⚠️ RELATED THREATS
Attempts to override bot instructions via malicious content embedded in web pages
Data ExfiltrationBots attempting to extract sensitive data from websites including PII and credentials
Credential StuffingAutomated login attempts using leaked credentials from data breaches
Aggressive Content ScrapingBots aggressively scraping content beyond robots.txt limits and ToS
📂 MORE 🔍 SEARCH & AI CRAWLERS
📚 RELATED GUIDES
PROTECT YOUR WEBSITE
Deploy SiteTrust to monitor and control AI bot access to your site with the Agent Passport Standard.
INSTALL SITETRUST →