GUIDEUPDATED FEBRUARY 2026

HOW TO DETECT AI BOTS ON YOUR WEBSITE

AI bots now account for over 20% of all web traffic. GPTBot, ClaudeBot, PerplexityBot, and dozens of other AI crawlers visit your site daily — for training data collection, real-time search grounding, and content indexing. Here's how to detect, monitor, and manage them.

METHOD 1: USER-AGENT STRING MATCHING

The most straightforward detection method. Most legitimate AI crawlers identify themselves via their User-Agent header. Here are the current (February 2026) User-Agent strings for major AI bots:

BOTUSER-AGENT STRINGPURPOSE
GPTBotGPTBot/1.2OpenAI model training
ChatGPT-UserChatGPT-User/1.0ChatGPT live browsing
OAI-SearchBotOAI-SearchBot/1.0OpenAI search grounding
ClaudeBotClaudeBot/1.0Anthropic model training
Claude-Webclaude-webClaude web browsing
PerplexityBotPerplexityBot/1.0Perplexity index building
Perplexity-UserPerplexity-UserReal-time search fetch
Google-ExtendedGoogle-ExtendedGemini AI training
BytespiderBytespiderByteDance/TikTok AI
Meta-ExternalAgentmeta-externalagent/1.0Meta AI training
Cohere-AIcohere-aiCohere model training
CCBotCCBot/2.0Common Crawl archive
NGINX DETECTION EXAMPLE
# /etc/nginx/conf.d/ai-bot-detection.conf
map $http_user_agent $is_ai_bot {
    default         0;
    ~*GPTBot        1;
    ~*ChatGPT-User  1;
    ~*OAI-SearchBot 1;
    ~*ClaudeBot     1;
    ~*claude-web    1;
    ~*PerplexityBot 1;
    ~*Bytespider    1;
    ~*Google-Extended 1;
    ~*CCBot         1;
    ~*meta-externalagent 1;
    ~*cohere-ai     1;
    ~*anthropic-ai  1;
}

server {
    # Log AI bots separately
    access_log /var/log/nginx/ai-bots.log combined if=$is_ai_bot;
    
    # Optional: rate limit AI bots
    limit_req_zone $is_ai_bot zone=ai_bots:10m rate=10r/s;
}
NEXT.JS MIDDLEWARE DETECTION
// middleware.ts
import { NextRequest, NextResponse } from 'next/server'

const AI_BOT_PATTERNS = [
  /GPTBot/i, /ChatGPT-User/i, /OAI-SearchBot/i,
  /ClaudeBot/i, /claude-web/i, /anthropic-ai/i,
  /PerplexityBot/i, /Perplexity-User/i,
  /Google-Extended/i, /Bytespider/i,
  /CCBot/i, /meta-externalagent/i, /cohere-ai/i,
]

export function middleware(request: NextRequest) {
  const ua = request.headers.get('user-agent') || ''
  const isAIBot = AI_BOT_PATTERNS.some(p => p.test(ua))
  
  if (isAIBot) {
    // Log, track, or report to NORAD
    console.log(`AI Bot detected: ${ua.slice(0, 100)}`)
    
    // Optional: report to NORAD
    fetch('https://api.clawbotden.com/api/v1/public/norad/ingest', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        bot_type: ua.match(AI_BOT_PATTERNS.find(p => p.test(ua))!)?.[0] || 'Unknown',
        user_agent: ua.slice(0, 500),
        decision: 'allow',
        page_url: request.nextUrl.href,
      }),
    }).catch(() => {})
  }
  
  return NextResponse.next()
}

METHOD 2: IP RANGE VERIFICATION

User-Agent strings can be spoofed. For high-confidence detection, verify the source IP against published crawler IP ranges. Several AI companies publish their crawler IPs:

OPENAI (GPTBot)

Published at openai.com/gptbot-ranges.txt

OpenAI publishes a JSON file of IP ranges that GPTBot and ChatGPT-User use. Verify by doing a reverse DNS lookup — legitimate requests resolve to *.openai.com.

GOOGLE (Googlebot / Google-Extended)

Published at developers.google.com/search/docs/crawling-indexing/verifying-googlebot

Reverse DNS must resolve to *.googlebot.com or *.google.com. Google publishes complete IP ranges as JSON.

MICROSOFT (Bingbot)

Published at bing.com/webmasters/help/how-to-verify-bingbot

Reverse DNS resolves to *.search.msn.com. IP ranges published via Bing Webmaster Tools.

ANTHROPIC (ClaudeBot)

No published IP ranges

Anthropic does not currently publish ClaudeBot IP ranges. Detection relies on User-Agent matching and behavioral analysis.

REVERSE DNS VERIFICATION (BASH)
#!/bin/bash
# Verify if an IP belongs to a known AI crawler
IP="66.249.66.1"

# Reverse DNS lookup
HOST=$(dig -x $IP +short)
echo "Reverse DNS: $HOST"

# Forward DNS verification
if [[ "$HOST" == *"googlebot.com." ]] || [[ "$HOST" == *"google.com." ]]; then
    FORWARD=$(dig $HOST +short)
    if [[ "$FORWARD" == "$IP" ]]; then
        echo "✅ Verified Googlebot"
    else
        echo "❌ Spoofed Googlebot"
    fi
elif [[ "$HOST" == *"openai.com." ]]; then
    echo "✅ Verified OpenAI crawler"
elif [[ "$HOST" == *"search.msn.com." ]]; then
    echo "✅ Verified Bingbot"
else
    echo "⚠️ Unknown origin: $HOST"
fi

METHOD 3: BEHAVIORAL FINGERPRINTING

Some AI bots disguise their User-Agent or use headless browsers. Behavioral fingerprinting detects them through JavaScript-based checks:

navigator.webdriver

True for automated browsers (Puppeteer, Playwright, Selenium)

window.chrome === undefined

Missing in headless Chrome environments

navigator.plugins.length === 0

Real browsers have plugins; headless ones don't

No mouse/touch events

Bots don't generate human interaction events

Canvas fingerprint anomalies

Headless browsers produce different canvas hashes

WebGL renderer = "SwiftShader"

Google's software renderer used in headless Chrome

METHOD 4: AUTOMATED MONITORING WITH NORAD

Instead of building custom detection, use NORAD's three-layer detection system. Install a single script tag or CMS plugin and get automatic detection of 35+ AI bots with real-time alerting and global analytics.

ONE-LINE INSTALLATION

<script src="https://norad.io/site-trust.js" 
  data-site-id="YOUR_SITE_ID" 
  data-mode="monitor" async></script>
35+
AI bots detected
3
Detection layers
<1ms
Detection latency

FREQUENTLY ASKED QUESTIONS

How many AI bots are currently crawling the web?
As of February 2026, NORAD tracks 35+ distinct AI bots from OpenAI, Anthropic, Google, Meta, ByteDance, Perplexity AI, and others. AI bots account for approximately 5-20% of all website traffic.
Can AI bots spoof their User-Agent?
Yes. Some crawlers use standard browser User-Agent strings. This is why NORAD uses three detection layers: UA matching, IP verification, and behavioral fingerprinting.
Should I block AI crawlers?
Usually no. Blocking AI crawlers removes your content from AI-generated answers (ChatGPT, Perplexity, Claude). Monitor first, then decide. Use NORAD to see exactly who's crawling your site before making blocking decisions.
What's the difference between GPTBot and ChatGPT-User?
GPTBot crawls for model training (autonomous). ChatGPT-User fetches pages in real-time when a user asks ChatGPT to browse a URL (user-triggered). Different purposes, different frequencies.
How do I detect bots that don't identify themselves?
Behavioral fingerprinting: check navigator.webdriver, plugin count, canvas fingerprint, and interaction events. NORAD's SiteTrust.js automates this.