What is the Scrapy User-Agent string?

The Scrapy User-Agent string is: Scrapy/2.11 (+https://scrapy.org)

Does Scrapy respect robots.txt?

Yes, Scrapy respects robots.txt directives.

SCRAPY

Q: How do I block Scrapy in robots.txt?

To control Scrapy access, add the following to your robots.txt file: User-agent: Scrapy Disallow: / # Scrapy respects ROBOTSTXT_OBEY setting (default: True) # But operators can disable this in their configuration

MEDIUM RISK📊 SEO & DATA SCRAPER

Python web scraping framework — the most popular open-source scraping tool

ORGANIZATION

Open Source

FIRST SEEN

2008-01

RESPECTS ROBOTS.TXT

✓ YES

DOCUMENTATION

scrapy.org

—

DAILY VISITS

—

COUNTRIES ACTIVE

TRACKING

STATUS

—

LAST SEEN

📡 SCRAPY USER-AGENT STRING

Scrapy/2.11 (+https://scrapy.org)

This is the User-Agent header sent by Scrapy in HTTP requests. Use this to identify Scrapy in your server access logs.

📋 ABOUT SCRAPY

Scrapy is the most widely-used open-source web scraping framework, written in Python. It provides a complete toolkit for extracting structured data from websites, including built-in support for following links, handling pagination, managing request queues, and exporting data in various formats. Scrapy is used by thousands of companies and individuals for legitimate data collection.

By default, Scrapy identifies itself in the User-Agent header and respects robots.txt directives. However, these settings are easily configurable, and many Scrapy operators customize the User-Agent to mimic regular browsers or other bots. Scrapy does not render JavaScript by default, though integration with Splash or Playwright can add rendering capabilities.

NORAD.io monitors Scrapy-based crawling activity and classifies it as medium risk due to the ease with which Scrapy can be configured for aggressive scraping. NORAD's behavioral analysis can identify Scrapy-like crawl patterns even when the User-Agent is spoofed, using signals like request timing, crawl depth patterns, and content access sequences.

🎯 HOW TO DETECT SCRAPY

▸Default User-Agent starts with 'Scrapy/' — easily changed by operators
▸Scrapy respects robots.txt by default but this is configurable
▸Look for systematic crawl patterns: depth-first or breadth-first across site sections
▸Default concurrent requests: 16 — can create noticeable traffic patterns
▸Often runs from cloud servers (AWS, DigitalOcean, etc.)

🔄 CRAWL BEHAVIOR

Highly configurable crawl behavior. Default configuration is polite with auto-throttling, but operators can set aggressive rates. Supports concurrent requests, crawl depth limits, and custom middleware. Does not render JavaScript by default.

PURPOSE

General-purpose web scraping framework used for data extraction, research, price monitoring, content aggregation, and competitive intelligence. Used by individuals, companies, and researchers.

🤖 ROBOTS.TXT CONFIGURATION

User-agent: Scrapy
Disallow: /

# Scrapy respects ROBOTSTXT_OBEY setting (default: True)
# But operators can disable this in their configuration

Scrapy respects robots.txt directives. Add this to your robots.txt file at the root of your domain.

→ Complete Guide: robots.txt for AI Bots

🗺️ WHERE IS SCRAPY ACTIVE?

⚠️ RELATED THREATS

Prompt Injection

Attempts to override bot instructions via malicious content embedded in web pages

Data Exfiltration

Bots attempting to extract sensitive data from websites including PII and credentials

Credential Stuffing

Automated login attempts using leaked credentials from data breaches

Aggressive Content Scraping

Bots aggressively scraping content beyond robots.txt limits and ToS

🔗 RELATED BOTS

Python RequestsMEDIUM

Unknown · Python HTTP library — the most common library used for automated web access

📂 MORE 📊 SEO & DATA SCRAPERS

AhrefsBotAhrefs SemrushBotSemrush Python RequestsUnknown cURLOpen Source MJ12botMajestic DotBotMoz

📚 RELATED GUIDES

How to Detect AI Bots →robots.txt for AI Bots →NORAD API Docs →

PROTECT YOUR WEBSITE

Deploy SiteTrust to monitor and control AI bot access to your site with the Agent Passport Standard.

INSTALL SITETRUST →