GOOGLE-EXTENDED

LOW RISK🔍 SEARCH & AI CRAWLER

Google's AI training crawler for Gemini — separate from Googlebot search indexing

ORGANIZATION
Google
FIRST SEEN
2023-09
RESPECTS ROBOTS.TXT
✓ YES
DOCUMENTATION
developers.google.com
DAILY VISITS
COUNTRIES ACTIVE
TRACKING
STATUS
LAST SEEN

📡 GOOGLE-EXTENDED USER-AGENT STRING

Mozilla/5.0 (compatible; Google-Extended; +https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers)

This is the User-Agent header sent by Google-Extended in HTTP requests. Use this to identify Google-Extended in your server access logs.

📋 ABOUT GOOGLE-EXTENDED

Google-Extended is a distinct crawler token introduced by Google in September 2023 to give website owners granular control over how their content is used for AI training. While Googlebot crawls for Google Search indexing, Google-Extended specifically collects data for training Google's Gemini AI models and improving other Google AI products.

The critical distinction is that blocking Google-Extended in robots.txt does not affect your site's Google Search visibility. This separation allows website operators to continue appearing in Google Search results while opting out of having their content used to train Google's AI models. Google-Extended shares the same IP infrastructure as Googlebot, so identification relies on the User-Agent string rather than IP ranges.

NORAD.io tracks Google-Extended separately from Googlebot to give site operators clear visibility into AI training crawl activity versus search indexing. This distinction is essential for content licensing decisions and AI data governance policies.

🎯 HOW TO DETECT GOOGLE-EXTENDED

  • User-Agent token is 'Google-Extended' — check robots.txt compliance separately from Googlebot
  • Shares IP ranges with Googlebot, so IP-based detection alone is insufficient
  • The key differentiator is the User-Agent string
  • Blocking Google-Extended has no impact on Google Search rankings
  • May appear in server logs alongside regular Googlebot requests from same IPs

🌐 GOOGLE-EXTENDED KNOWN IP RANGES

66.249.64.0/1964.233.160.0/19

Use these CIDR ranges to verify Google-Extended identity at the network level. Always combine with User-Agent verification for accurate detection.

🔄 CRAWL BEHAVIOR

Shares infrastructure with Googlebot but uses a separate User-Agent token for robots.txt control. Blocking Google-Extended does not affect Google Search indexing. Moderate crawl rates.

PURPOSE

Collects web content specifically for training Google's Gemini AI models and improving Vertex AI products. Separate from search indexing, giving site owners independent control over AI training use.

🤖 ROBOTS.TXT CONFIGURATION

# Block AI training but keep search indexing:
User-agent: Google-Extended
Disallow: /

# This does NOT affect Googlebot search crawling

Google-Extended respects robots.txt directives. Add this to your robots.txt file at the root of your domain.

🗺️ WHERE IS GOOGLE-EXTENDED ACTIVE?

⚠️ RELATED THREATS

🔗 RELATED BOTS

📂 MORE 🔍 SEARCH & AI CRAWLERS

📚 RELATED GUIDES

PROTECT YOUR WEBSITE

Deploy SiteTrust to monitor and control AI bot access to your site with the Agent Passport Standard.

INSTALL SITETRUST →