GPTBOT

LOW RISK🔍 SEARCH & AI CRAWLER

OpenAI's web crawler used for training GPT models and improving AI capabilities

ORGANIZATION
OpenAI
FIRST SEEN
2023-08
RESPECTS ROBOTS.TXT
✓ YES
DOCUMENTATION
platform.openai.com
DAILY VISITS
COUNTRIES ACTIVE
TRACKING
STATUS
LAST SEEN

📡 GPTBOT USER-AGENT STRING

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)

This is the User-Agent header sent by GPTBot in HTTP requests. Use this to identify GPTBot in your server access logs.

📋 ABOUT GPTBOT

GPTBot is OpenAI's official web crawler, first publicly documented in August 2023. It systematically crawls publicly accessible web pages to collect training data for OpenAI's large language models, including GPT-4, GPT-4o, and future model generations. GPTBot identifies itself clearly in the User-Agent header and operates from a set of published IP ranges, making it straightforward to identify and control.

Unlike OpenAI's ChatGPT-User bot (which fetches pages in real-time during conversations), GPTBot performs batch crawling operations for training data collection. It respects robots.txt directives, and OpenAI provides clear documentation on how to opt out of crawling. The bot does not execute JavaScript, does not render pages, and focuses on extracting text content from HTML pages. It follows links discovered in sitemaps and page content.

NORAD.io tracks GPTBot activity across its global sensor network, providing real-time visibility into crawl frequency, geographic distribution, and behavioral patterns. Many website operators use NORAD to monitor how aggressively GPTBot crawls their content and to enforce access policies through the Agent Passport Standard.

🎯 HOW TO DETECT GPTBOT

  • Check for 'GPTBot' in the User-Agent header string
  • Verify source IPs against OpenAI's published IP ranges (20.15.240.0/20, 40.83.2.64/28)
  • GPTBot does not execute JavaScript — if your bot detection relies on JS challenges, GPTBot will fail them
  • Crawl pattern is typically breadth-first across sitemaps
  • Does not load images, CSS, or other static assets

🌐 GPTBOT KNOWN IP RANGES

20.15.240.64/2820.15.240.80/2820.15.240.96/2820.15.240.176/2820.15.241.0/2820.15.242.128/2820.15.242.144/2840.83.2.64/28

Use these CIDR ranges to verify GPTBot identity at the network level. Always combine with User-Agent verification for accurate detection.

🔄 CRAWL BEHAVIOR

Crawls pages at moderate frequency. Respects robots.txt and rate limits. Fetches HTML content primarily. Does not execute JavaScript. Typical crawl intervals of several hours between revisits.

PURPOSE

Collects publicly available web content to train and improve OpenAI's GPT language models including GPT-4 and future versions. Data is used for pre-training and fine-tuning.

🤖 ROBOTS.TXT CONFIGURATION

User-agent: GPTBot
Disallow: /private/
Disallow: /api/

# To block completely:
# User-agent: GPTBot
# Disallow: /

GPTBot respects robots.txt directives. Add this to your robots.txt file at the root of your domain.

🗺️ WHERE IS GPTBOT ACTIVE?

⚠️ RELATED THREATS

🔗 RELATED BOTS

📂 MORE 🔍 SEARCH & AI CRAWLERS

📚 RELATED GUIDES

PROTECT YOUR WEBSITE

Deploy SiteTrust to monitor and control AI bot access to your site with the Agent Passport Standard.

INSTALL SITETRUST →