AI bots now account for over 20% of all web traffic. GPTBot, ClaudeBot, PerplexityBot, and dozens of other AI crawlers visit your site daily — for training data collection, real-time search grounding, and content indexing. Here's how to detect, monitor, and manage them.
The most straightforward detection method. Most legitimate AI crawlers identify themselves via their User-Agent header. Here are the current (February 2026) User-Agent strings for major AI bots:
| BOT | USER-AGENT STRING | PURPOSE |
|---|---|---|
| GPTBot | GPTBot/1.2 | OpenAI model training |
| ChatGPT-User | ChatGPT-User/1.0 | ChatGPT live browsing |
| OAI-SearchBot | OAI-SearchBot/1.0 | OpenAI search grounding |
| ClaudeBot | ClaudeBot/1.0 | Anthropic model training |
| Claude-Web | claude-web | Claude web browsing |
| PerplexityBot | PerplexityBot/1.0 | Perplexity index building |
| Perplexity-User | Perplexity-User | Real-time search fetch |
| Google-Extended | Google-Extended | Gemini AI training |
| Bytespider | Bytespider | ByteDance/TikTok AI |
| Meta-ExternalAgent | meta-externalagent/1.0 | Meta AI training |
| Cohere-AI | cohere-ai | Cohere model training |
| CCBot | CCBot/2.0 | Common Crawl archive |
# /etc/nginx/conf.d/ai-bot-detection.conf
map $http_user_agent $is_ai_bot {
default 0;
~*GPTBot 1;
~*ChatGPT-User 1;
~*OAI-SearchBot 1;
~*ClaudeBot 1;
~*claude-web 1;
~*PerplexityBot 1;
~*Bytespider 1;
~*Google-Extended 1;
~*CCBot 1;
~*meta-externalagent 1;
~*cohere-ai 1;
~*anthropic-ai 1;
}
server {
# Log AI bots separately
access_log /var/log/nginx/ai-bots.log combined if=$is_ai_bot;
# Optional: rate limit AI bots
limit_req_zone $is_ai_bot zone=ai_bots:10m rate=10r/s;
}// middleware.ts
import { NextRequest, NextResponse } from 'next/server'
const AI_BOT_PATTERNS = [
/GPTBot/i, /ChatGPT-User/i, /OAI-SearchBot/i,
/ClaudeBot/i, /claude-web/i, /anthropic-ai/i,
/PerplexityBot/i, /Perplexity-User/i,
/Google-Extended/i, /Bytespider/i,
/CCBot/i, /meta-externalagent/i, /cohere-ai/i,
]
export function middleware(request: NextRequest) {
const ua = request.headers.get('user-agent') || ''
const isAIBot = AI_BOT_PATTERNS.some(p => p.test(ua))
if (isAIBot) {
// Log, track, or report to NORAD
console.log(`AI Bot detected: ${ua.slice(0, 100)}`)
// Optional: report to NORAD
fetch('https://api.clawbotden.com/api/v1/public/norad/ingest', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
bot_type: ua.match(AI_BOT_PATTERNS.find(p => p.test(ua))!)?.[0] || 'Unknown',
user_agent: ua.slice(0, 500),
decision: 'allow',
page_url: request.nextUrl.href,
}),
}).catch(() => {})
}
return NextResponse.next()
}User-Agent strings can be spoofed. For high-confidence detection, verify the source IP against published crawler IP ranges. Several AI companies publish their crawler IPs:
Published at openai.com/gptbot-ranges.txt
OpenAI publishes a JSON file of IP ranges that GPTBot and ChatGPT-User use. Verify by doing a reverse DNS lookup — legitimate requests resolve to *.openai.com.
Published at developers.google.com/search/docs/crawling-indexing/verifying-googlebot
Reverse DNS must resolve to *.googlebot.com or *.google.com. Google publishes complete IP ranges as JSON.
Published at bing.com/webmasters/help/how-to-verify-bingbot
Reverse DNS resolves to *.search.msn.com. IP ranges published via Bing Webmaster Tools.
No published IP ranges
Anthropic does not currently publish ClaudeBot IP ranges. Detection relies on User-Agent matching and behavioral analysis.
#!/bin/bash
# Verify if an IP belongs to a known AI crawler
IP="66.249.66.1"
# Reverse DNS lookup
HOST=$(dig -x $IP +short)
echo "Reverse DNS: $HOST"
# Forward DNS verification
if [[ "$HOST" == *"googlebot.com." ]] || [[ "$HOST" == *"google.com." ]]; then
FORWARD=$(dig $HOST +short)
if [[ "$FORWARD" == "$IP" ]]; then
echo "✅ Verified Googlebot"
else
echo "❌ Spoofed Googlebot"
fi
elif [[ "$HOST" == *"openai.com." ]]; then
echo "✅ Verified OpenAI crawler"
elif [[ "$HOST" == *"search.msn.com." ]]; then
echo "✅ Verified Bingbot"
else
echo "⚠️ Unknown origin: $HOST"
fiSome AI bots disguise their User-Agent or use headless browsers. Behavioral fingerprinting detects them through JavaScript-based checks:
navigator.webdriverTrue for automated browsers (Puppeteer, Playwright, Selenium)
window.chrome === undefinedMissing in headless Chrome environments
navigator.plugins.length === 0Real browsers have plugins; headless ones don't
No mouse/touch eventsBots don't generate human interaction events
Canvas fingerprint anomaliesHeadless browsers produce different canvas hashes
WebGL renderer = "SwiftShader"Google's software renderer used in headless Chrome
Instead of building custom detection, use NORAD's three-layer detection system. Install a single script tag or CMS plugin and get automatic detection of 35+ AI bots with real-time alerting and global analytics.
<script src="https://norad.io/site-trust.js" data-site-id="YOUR_SITE_ID" data-mode="monitor" async></script>