GPTBOT
LOW RISK🔍 SEARCH & AI CRAWLEROpenAI's web crawler used for training GPT models and improving AI capabilities
📡 GPTBOT USER-AGENT STRING
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)
This is the User-Agent header sent by GPTBot in HTTP requests. Use this to identify GPTBot in your server access logs.
📋 ABOUT GPTBOT
GPTBot is OpenAI's official web crawler, first publicly documented in August 2023. It systematically crawls publicly accessible web pages to collect training data for OpenAI's large language models, including GPT-4, GPT-4o, and future model generations. GPTBot identifies itself clearly in the User-Agent header and operates from a set of published IP ranges, making it straightforward to identify and control.
Unlike OpenAI's ChatGPT-User bot (which fetches pages in real-time during conversations), GPTBot performs batch crawling operations for training data collection. It respects robots.txt directives, and OpenAI provides clear documentation on how to opt out of crawling. The bot does not execute JavaScript, does not render pages, and focuses on extracting text content from HTML pages. It follows links discovered in sitemaps and page content.
NORAD.io tracks GPTBot activity across its global sensor network, providing real-time visibility into crawl frequency, geographic distribution, and behavioral patterns. Many website operators use NORAD to monitor how aggressively GPTBot crawls their content and to enforce access policies through the Agent Passport Standard.
🎯 HOW TO DETECT GPTBOT
- ▸Check for 'GPTBot' in the User-Agent header string
- ▸Verify source IPs against OpenAI's published IP ranges (20.15.240.0/20, 40.83.2.64/28)
- ▸GPTBot does not execute JavaScript — if your bot detection relies on JS challenges, GPTBot will fail them
- ▸Crawl pattern is typically breadth-first across sitemaps
- ▸Does not load images, CSS, or other static assets
🌐 GPTBOT KNOWN IP RANGES
20.15.240.64/2820.15.240.80/2820.15.240.96/2820.15.240.176/2820.15.241.0/2820.15.242.128/2820.15.242.144/2840.83.2.64/28Use these CIDR ranges to verify GPTBot identity at the network level. Always combine with User-Agent verification for accurate detection.
🔄 CRAWL BEHAVIOR
Crawls pages at moderate frequency. Respects robots.txt and rate limits. Fetches HTML content primarily. Does not execute JavaScript. Typical crawl intervals of several hours between revisits.
Collects publicly available web content to train and improve OpenAI's GPT language models including GPT-4 and future versions. Data is used for pre-training and fine-tuning.
🤖 ROBOTS.TXT CONFIGURATION
User-agent: GPTBot Disallow: /private/ Disallow: /api/ # To block completely: # User-agent: GPTBot # Disallow: /
GPTBot respects robots.txt directives. Add this to your robots.txt file at the root of your domain.
🗺️ WHERE IS GPTBOT ACTIVE?
⚠️ RELATED THREATS
Attempts to override bot instructions via malicious content embedded in web pages
Data ExfiltrationBots attempting to extract sensitive data from websites including PII and credentials
Credential StuffingAutomated login attempts using leaked credentials from data breaches
Aggressive Content ScrapingBots aggressively scraping content beyond robots.txt limits and ToS
🔗 RELATED BOTS
📂 MORE 🔍 SEARCH & AI CRAWLERS
📚 RELATED GUIDES
PROTECT YOUR WEBSITE
Deploy SiteTrust to monitor and control AI bot access to your site with the Agent Passport Standard.
INSTALL SITETRUST →