Back to Blog
Guides
Mihai MaximLast updated on May 13, 202615 min read

Web Scraping with Scrapy: 2026 Playbook

Web Scraping with Scrapy: 2026 Playbook
TL;DR: This is an opinionated, end-to-end guide to web scraping with Scrapy in 2026. You will install Scrapy, prototype selectors in the shell, build a multi-page e-commerce spider, clean items with Item Loaders, persist to a database, harden settings against bans, and bolt on Scrapy-Playwright for JavaScript-rendered pages.

Scrapy has been the backbone of serious Python crawling for over a decade, and despite a wave of newer async libraries it still earns its keep. If you are doing web scraping with Scrapy today, you get an opinionated framework that solves the boring parts (request scheduling, deduplication, retries, item pipelines) so you can focus on the parts that actually break: selectors, anti-bot, and storage.

This guide is structured around the request and response lifecycle rather than a chronological build-up. Every section maps to a Scrapy component you will touch in production, from the engine and downloader middlewares down to Item Loaders and feed exports. We use a single target throughout, the public practice site books.toscrape.com, so every code block fits into one mental model.

By the end you will have a runnable spider that paginates a catalog, validates and cleans items, writes to both JSON Lines and SQLite, retries on 429 storms, and falls back to a real browser when a page needs JavaScript. We will also flag the parts of the framework that newcomers consistently misuse, with copyable fixes.

Why Scrapy Still Owns Production Scraping in 2026

It is tempting to reach for httpx plus selectolax and call it a day. For a one-off script that is the right move. For a crawler that has to run nightly, deduplicate URLs, survive a partial outage, and write to two destinations, you want a framework. Scrapy remains the industry standard for large-scale data extraction at the time of writing, and the reason is simple: it ships with the scheduler, dupe filter, retry middleware, throttle, signals, and feed exports already wired together.

Compared with stitching together requests and BeautifulSoup, Scrapy is opinionated in a useful way. It runs on Twisted's event loop, so a single process can fan out hundreds of concurrent requests without the cognitive overhead of async/await. You do not write the crawl loop. You declare the entry URLs and the parsing logic, and the engine handles the queue. That contract is what makes Scrapy worth the steeper learning curve.

How web scraping with Scrapy works: the request and response lifecycle

Before you write a spider, internalize the lifecycle. A Scrapy run looks like this:

  1. The engine pulls a Request off the scheduler.
  2. The request passes through the downloader middlewares (in priority order). This is where headers get set, cookies attach, proxies rotate, and retries fire.
  3. The downloader issues the HTTP call and returns a Response.
  4. The response goes back through the downloader middlewares on the way in, then through the spider middlewares, then into your spider's callback (usually parse).
  5. Your callback yields either more Request objects (which go back to the scheduler) or Items (which flow into the item pipelines).
  6. Pipelines validate, transform, drop, or persist each item.
  7. Anything that survives is handed to the feed exporter, which writes to disk, S3, or stdout.

Two terms you will see in callbacks: callback is the function Scrapy runs when a request succeeds, and errback is the function it runs when a request fails. Spiders are typically written as Python generators, yielding requests and items lazily so the engine can interleave work.

Knowing this loop is the difference between "my spider works" and "my spider scales". When pages come back empty, the answer is almost always in the downloader middleware layer. When items disappear, the answer is in a pipeline. When pagination dies, it is your callback. Map the symptom to the stage, then fix the right component.

A more detailed walkthrough lives in the official Scrapy architecture documentation, which is worth bookmarking.

Installing Scrapy and Bootstrapping a Project

Scrapy targets modern Python 3 (check the official installation guide for the minimum version at the time you install). The docs strongly recommend a dedicated virtual environment so Scrapy's pinned dependencies do not collide with system packages.

python -m venv .venv
source .venv/bin/activate     # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install scrapy
scrapy version

Once scrapy version prints a version string, scaffold a project:

scrapy startproject bookstore
cd bookstore

You now have a project tree that looks the same in every Scrapy codebase on Earth, which is exactly the point. Every time you onboard onto a new Scrapy repo, you already know where the spiders live, where the settings sit, and which file owns the pipelines. That repeatability is half the value of using a framework in the first place. Resist the urge to flatten the layout: downstream tools like scrapyd and scrapy crawl rely on it.

Inside a Scrapy Project: What Each File Does

scrapy startproject produces five files and a folder that you will touch every day.

  • scrapy.cfg is the top-level project config. It names the project and tells scrapyd where to find the settings module.
  • items.py is the schema layer. You define Product, Article, or whatever classes you want, each inheriting from scrapy.Item. Treat this like a dataclass for scraped output.
  • pipelines.py is where extracted items get cleaned, validated, dropped, or written to a database. Each pipeline is a plain class with a process_item method.
  • middlewares.py holds downloader and spider middlewares. This is the file where you rotate user agents, attach proxies, or route requests through a managed scraping API.
  • settings.py is the central configuration object: concurrency, throttling, retries, pipelines, middlewares, and feed exports all live here.
  • spiders/ is the folder where individual spider files live. One spider per target site is a healthy default.

Prototyping Selectors in the Scrapy Shell

The Scrapy shell is the secret weapon nobody mentions enough. Before you write a single line of spider code, open the shell against a real URL and iterate on selectors interactively. It saves hours.

scrapy shell "https://books.toscrape.com/catalogue/page-1.html"

Inside the shell you get a live response object pre-loaded with the page. Three commands matter:

  • fetch("https://example.com") swaps in a new response without leaving the shell.
  • view(response) opens the downloaded HTML in your default browser, which is how you confirm you are working with the same DOM the spider sees, not the rendered one your browser would normally show.
  • response.css(...) and response.xpath(...) let you test selectors against the live response.

Try this against the practice site:

>>> response.css("article.product_pod h3 a::attr(title)").getall()[:3]
['A Light in the Attic', 'Tipping the Velvet', 'Soumission']
>>> response.xpath("//article[@class='product_pod']//p[@class='price_color']/text()").get()
'£51.77'

Iterate until both selectors return clean data. Only then move the expression into your spider. The cost of debugging a broken XPath inside a 5-minute crawl is much higher than the cost of one shell session.

Writing Your First Spider for Web Scraping with Scrapy

Generate a spider stub against your target domain:

scrapy genspider books books.toscrape.com

That creates spiders/books.py. Replace its contents with the spider below. It scrapes the catalog landing page, extracts each book's title, price, and rating, then yields a Python dict per book. We will upgrade it to real Items in a later section.

import scrapy

class BooksSpider(scrapy.Spider):
    name = "books"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com/catalogue/page-1.html"]

    def parse(self, response):
        for card in response.css("article.product_pod"):
            yield {
                "title": card.css("h3 a::attr(title)").get(),
                "price": card.css("p.price_color::text").get(),
                "rating": card.css("p.star-rating::attr(class)").get(),
                "url": response.urljoin(card.css("h3 a::attr(href)").get()),
            }

Run it from the project root:

scrapy crawl books -o books.jsonl

You should see Scrapy log a request to page 1, twenty items scraped, then a clean shutdown. Open books.jsonl and confirm one JSON object per line.

A few things to notice. start_urls is the entry point, the engine schedules each URL automatically. parse is the default callback. response.urljoin resolves a relative href against the current page so you do not end up with broken links. The rating field still contains noise like "star-rating Three", which is exactly the kind of cleanup Item Loaders will handle later.

Production note: running with -o is fine for a quick test, but never depend on it in a scheduled job. Configure the FEEDS setting in settings.py instead so the output destination, format, and overwrite behavior are version-controlled. We will wire that up alongside a database pipeline in the persistence section. Treat the CLI flag as a development shortcut, not a deployment artifact.

CSS vs XPath: Picking Selectors That Do Not Break

Both selector engines ship with Scrapy and both run against the same parsed tree. Use whichever is shorter and clearer for the job. As a rule of thumb, CSS wins for class-based and structural queries, XPath wins when you need to walk the tree by text content, by sibling, or by ancestor.

# CSS: short, idiomatic, fast to write
response.css("article.product_pod p.price_color::text").get()

# XPath equivalent
response.xpath("//article[@class='product_pod']//p[@class='price_color']/text()").get()

XPath earns its place when CSS cannot express what you need:

# "Find the <td> that follows the <th> whose text is 'Stock'"
response.xpath("//th[normalize-space()='Stock']/following-sibling::td/text()").get()

# "Find all links whose visible text contains 'Next'"
response.xpath("//a[contains(., 'Next')]/@href").getall()

A few habits that keep selectors stable: prefer attribute selectors over fragile positional ones (nth-child(3) will eventually break), normalize whitespace when you compare text (normalize-space()), and combine .get() for a single match with .getall() for a list, never index into the result of .getall() blindly. For a deeper comparison of when each engine is the right pick, our XPath vs CSS selectors guide is a good companion read.

Production note: when a selector returns None in production but works in the shell, the page was probably JavaScript-rendered. Confirm with view(response) before blaming the selector.

Items and Item Loaders: Reusable Cleaning Patterns

Yielding plain dicts is fine for ten lines of code. At scale you want a typed schema so a typo in a field name fails fast instead of silently producing junk rows. Define an Item in items.py:

import scrapy
from itemloaders.processors import MapCompose, TakeFirst, Join

def to_float(value):
    return float(value.replace("£", "").replace("$", "").strip())

def normalize_rating(value):
    # "star-rating Three" -> "Three"
    parts = value.split()
    return parts[1] if len(parts) > 1 else value

class ProductItem(scrapy.Item):
    title = scrapy.Field(input_processor=MapCompose(str.strip), output_processor=TakeFirst())
    price = scrapy.Field(input_processor=MapCompose(str.strip, to_float), output_processor=TakeFirst())
    rating = scrapy.Field(input_processor=MapCompose(normalize_rating), output_processor=TakeFirst())
    description = scrapy.Field(input_processor=MapCompose(str.strip), output_processor=Join(" "))

MapCompose chains transformers, TakeFirst collapses a list of matches to a single value, and Join glues multiple paragraphs into one. Use a loader in the spider so the spider stays readable:

from scrapy.loader import ItemLoader
from bookstore.items import ProductItem

def parse(self, response):
    for card in response.css("article.product_pod"):
        loader = ItemLoader(item=ProductItem(), selector=card)
        loader.add_css("title", "h3 a::attr(title)")
        loader.add_css("price", "p.price_color::text")
        loader.add_css("rating", "p.star-rating::attr(class)")
        yield loader.load_item()

The win is reuse. Once to_float lives in items.py, every price-bearing item on every spider can call it. Cleaning logic stops being copy-pasted across callbacks.

There are two idiomatic ways to crawl multiple pages in Scrapy. Pick based on how predictable the link structure is.

Manual pagination is the right choice when there is a single "next" link to follow. Add this at the end of parse:

next_page = response.css("li.next a::attr(href)").get()
if next_page:
    yield response.follow(next_page, callback=self.parse)

response.follow handles relative URLs and re-uses the same callback, which is exactly what catalog-style pagination needs. The crawl stops naturally when the "next" link disappears on the final page.

CrawlSpider is the right choice when you want to sweep an entire site by matching URL patterns. It uses Rule and LinkExtractor to discover and follow links automatically:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class BooksCrawl(CrawlSpider):
    name = "books_crawl"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com/"]
    rules = (
        Rule(LinkExtractor(restrict_css=".pager a")),  # follow pagination
        Rule(LinkExtractor(restrict_css="h3 a"), callback="parse_book"),
    )

    def parse_book(self, response):
        yield {
            "title": response.css("h1::text").get(),
            "price": response.css("p.price_color::text").get(),
        }

Scrapy's built-in RFPDupeFilter ensures the same URL is not enqueued twice, so you do not need to track visited links yourself. Set DEPTH_LIMIT in settings.py when you crawl a deep site and want a hard stop.

Production note: for sitemap-friendly sites, SitemapSpider is even simpler. It reads /sitemap.xml directly and lets you filter URL patterns with sitemap_rules.

Persisting Results: FEEDS and a Database Pipeline

Web scraping with Scrapy gives you two persistence layers, and you usually want both. The FEEDS setting handles structured exports for free, while a pipeline owns custom destinations like a relational database.

Configure feeds in settings.py. Check the Scrapy feed exports docs for the current syntax, but a modern config looks roughly like this:

FEEDS = {
    "data/books.jsonl": {
        "format": "jsonlines",
        "encoding": "utf-8",
        "overwrite": True,
    },
    "data/books.csv.gz": {
        "format": "csv",
        "postprocessing": ["scrapy.extensions.postprocessing.GzipPlugin"],
    },
}

JSON Lines is the right default: streamable, append-friendly, and easy to load into Pandas or a data warehouse. CSV with gzip is fine for analyst handoff. Both fall over for relational queries, which is where pipelines come in.

A SQLite pipeline that runs after a validator:

# pipelines.py
import sqlite3
from itemadapter import ItemAdapter

class SqlitePipeline:
    def open_spider(self, spider):
        self.conn = sqlite3.connect("data/books.db")
        self.conn.execute(
            "CREATE TABLE IF NOT EXISTS products (title TEXT, price REAL, rating TEXT)"
        )

    def close_spider(self, spider):
        self.conn.commit()
        self.conn.close()

    def process_item(self, item, spider):
        a = ItemAdapter(item)
        self.conn.execute(
            "INSERT INTO products(title, price, rating) VALUES (?, ?, ?)",
            (a["title"], a["price"], a["rating"]),
        )
        return item

Register it with a priority. Lower numbers run earlier, so a validator at 100 fires before the database writer at 200:

ITEM_PIPELINES = {
    "bookstore.pipelines.PriceRangeValidator": 100,
    "bookstore.pipelines.SqlitePipeline": 200,
}

Now invalid prices get dropped before they ever hit the database.

Hardening settings.py: AutoThrottle, Retries, and Caching

The default settings work in development and get you banned in production. The handful below are the ones that matter most. Verify the exact defaults against your installed Scrapy release.

# settings.py
ROBOTSTXT_OBEY = True            # respect the site's policy unless you have a contract
CONCURRENT_REQUESTS = 8          # global cap; lower for fragile sites
CONCURRENT_REQUESTS_PER_DOMAIN = 4
DOWNLOAD_DELAY = 0.5             # base delay; AutoThrottle adjusts dynamically

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
AUTOTHROTTLE_START_DELAY = 1.0
AUTOTHROTTLE_MAX_DELAY = 30.0

RETRY_ENABLED = True
RETRY_TIMES = 5
RETRY_HTTP_CODES = [429, 500, 502, 503, 504, 408, 522, 524]

HTTPCACHE_ENABLED = True         # huge time-saver during development
HTTPCACHE_EXPIRATION_SECS = 3600
HTTPCACHE_IGNORE_HTTP_CODES = [429, 500, 502, 503, 504]

AutoThrottle is the killer feature here. Instead of guessing a DOWNLOAD_DELAY, you give it a target concurrency and Scrapy slows down when latency rises. That alone prevents most accidental DDoS situations on slow sites.

HTTPCACHE_ENABLED is a development quality-of-life setting: while you iterate on selectors, identical requests come back from disk so you stop hammering the target. Turn it off in production.

For real anti-bot pressure, settings alone are not enough, and our guide on why scrapers get blocked covers the deeper patterns. Either way, the next layer is middlewares.

Downloader Middlewares: Headers, Proxies, and Managed APIs

When a site starts returning 403s, the fix is almost always in a downloader middleware. The skeleton is small:

# middlewares.py
import random

class RandomUserAgentMiddleware:
    UAS = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4) AppleWebKit/605.1.15 ...",
    ]
    def process_request(self, request, spider):
        request.headers["User-Agent"] = random.choice(self.UAS)

Register it in settings.py. Scrapy ships with a populated default middleware stack (over ten enabled out of the box), and middleware priority numbers are typically expressed in a documented integer range. Community guidance places custom anti-bot middleware before the built-in RetryMiddleware, whose default priority is 550, so retries see your rotated identity.

DOWNLOADER_MIDDLEWARES = {
    "bookstore.middlewares.RandomUserAgentMiddleware": 400,
    "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None,  # disable default
}

For proxy rotation, set request.meta["proxy"] in process_request. Community plugins exist for both rotating proxies and randomized user agents (and for distributed crawling, persistent caching, and monitoring), but check each project's current maintenance status before depending on it in production for web scraping with Scrapy at any serious scale.

The honest tradeoff: at some point, rolling your own headers, residential IPs, and CAPTCHA solving turns into a side project. That is where a managed Scraper API plugs in cleanly. Implement a middleware that rewrites request.url to point at the API endpoint and adds your API key as a header, and the rest of your spider does not change.

Scrapy-Playwright: The JavaScript Escape Hatch

Scrapy does not execute JavaScript on its own, so sites built with Angular, React, or any client-side framework return the shell HTML and not the data you can see in the browser. The cleanest fix in 2026 for web scraping with Scrapy on dynamic pages is scrapy-playwright, which swaps the default downloader for a real headless Chromium when you opt in per request.

Install it and verify the current handler-registration syntax against the scrapy-playwright README at install time:

# settings.py
DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

Opt requests in by setting meta:

def start_requests(self):
    yield scrapy.Request(
        "https://example-spa.com/products",
        meta={
            "playwright": True,
            "playwright_page_methods": [
                ("wait_for_selector", "article.product"),
            ],
        },
    )

Only flag the URLs that actually need a browser. Each Playwright request is dramatically more expensive than a plain Scrapy fetch, both in CPU and in latency, so a hybrid spider (HTML for listings, Playwright for product detail) is usually the right shape. If you want a deeper walkthrough or a comparison with the older Splash backend, our Scrapy-Playwright tutorial covers the patterns in detail.

Logging, Contracts, and Deployment

Production-grade web scraping with Scrapy needs three things the tutorials usually skip.

Logging. Set LOG_LEVEL = "INFO" in settings.py for normal runs and "DEBUG" only when something is wrong. Pipe logs to a file with LOG_FILE or stream them to a structured backend.

Spider contracts. Add docstring contracts to callbacks and run scrapy check in CI. A typical contract pins the URL, expected fields, and minimum item count, so a silent site change breaks the build instead of the dataset.

def parse(self, response):
    """
    @url https://books.toscrape.com/
    @returns items 20 20
    @scrapes title price rating
    """

Scheduling and deployment. scrapyd runs your project as a long-lived daemon you can deploy via scrapyd-client. For container-based stacks, build a thin Docker image with your project and run scrapy crawl on a cron schedule (or a Kubernetes CronJob). Either way, persist outputs to durable storage, not the container filesystem.

Common Pitfalls and How to Debug Them

  • Empty selectors. Selector worked in the shell, returns None in the spider. Almost always JavaScript-rendered. Confirm with view(response) and switch to scrapy-playwright for that URL.
  • 403 and 429 storms. Your fingerprint is obvious. Add a random User-Agent middleware, lower CONCURRENT_REQUESTS_PER_DOMAIN, raise AUTOTHROTTLE_START_DELAY, and confirm RETRY_HTTP_CODES includes 429.
  • Infinite pagination loops. The "next" selector also matches on the last page. Anchor it on a CSS class that disappears at the end, or set DEPTH_LIMIT.
  • Items silently dropped. A pipeline raised DropItem and you never noticed. Bump LOG_LEVEL to DEBUG, search the log for Dropped:, and validate your range checks.
  • Duplicate URLs slipping through. RFPDupeFilter matches by fingerprint, so URLs differing only by query-string order can sneak in. Normalize URLs before yielding requests.

Key Takeaways

  • Web scraping with Scrapy pays off when you need scheduling, deduplication, retries, throttling, and pipelines wired together out of the box, not when a 20-line script will do.
  • Map every symptom to a lifecycle stage: blocks live in downloader middlewares, missing items live in pipelines, and selector misses usually point to JavaScript rendering.
  • Item Loaders with MapCompose, TakeFirst, and Join keep cleaning logic reusable across spiders instead of copy-pasted across callbacks.
  • Persist with FEEDS for portable formats and a custom pipeline for relational storage. Use both, with pipeline priorities ordering validation before the database writer.
  • Treat AutoThrottle, retry codes, and a managed scraping API as a tiered defense against bans. Reach for scrapy-playwright only when the HTML is genuinely empty.

FAQ

Is Scrapy still worth learning in 2026 compared with newer async libraries?

Yes, for crawls beyond a few hundred pages. Newer async stacks like httpx plus selectolax are great for one-off scripts, but Scrapy bundles the scheduler, dupe filter, retry middleware, signals, and feed exports that you would otherwise write yourself. For a recurring production crawler, that batteries-included design still wins on maintenance cost.

Can Scrapy scrape JavaScript-rendered pages on its own, or do I need Playwright or Splash?

Not on its own. Scrapy fetches raw HTML and does not run JavaScript, so single-page apps return shell markup. The current best option is scrapy-playwright, which swaps the downloader for a real headless Chromium per request. scrapy-splash still works for some teams, but Playwright has broader browser support and active maintenance.

How does Scrapy compare to Beautiful Soup and Selenium for different project sizes?

Beautiful Soup is a parser, not a crawler, and pairs well with requests for small static scrapes. Selenium drives a full browser and is best for stateful, interactive flows like logged-in dashboards. Scrapy sits between them: a high-throughput crawling framework for hundreds to millions of pages, with browser rendering bolted on via scrapy-playwright when needed.

How do I deploy a Scrapy spider to run on a schedule in production?

Three common patterns. Run scrapyd as a daemon and trigger jobs through its HTTP API. Build a Docker image with your project and schedule scrapy crawl <name> via cron or a Kubernetes CronJob. Or use a managed scraping platform that hosts spiders for you. In all cases, persist outputs to durable storage like S3 or a database, never to a container filesystem.

How do I keep my Scrapy spider from getting blocked or IP-banned?

Layer your defenses. Enable AutoThrottle, randomize User-Agent headers via a downloader middleware, include 429 in RETRY_HTTP_CODES, and lower CONCURRENT_REQUESTS_PER_DOMAIN. For tougher sites, route requests through residential proxies or a managed scraper API that handles rotation and CAPTCHA solving behind one endpoint. Respect robots.txt and rate limits when you can.

Wrapping Up

The point of web scraping with Scrapy is not that you write less code than with requests plus BeautifulSoup. You usually write more on day one. The point is that the code you write on day one still works on day ninety, because the engine, scheduler, dupe filter, throttle, retry layer, and pipeline contract do not change underneath you. You buy yourself a stable substrate, then specialize the spiders, items, and middlewares to each target site.

If you internalize one thing from this guide, make it the request and response lifecycle. Every Scrapy bug you will ever hit lives at a specific stage of that loop, and naming the stage is half the fix. Selectors fail in the callback. Items vanish in the pipeline. Bans happen in the downloader. Pagination loops forever in your callback logic. Match the symptom to the stage and the fix gets obvious.

When the anti-bot pressure outgrows what you can build in middlewares.py, that is the right moment to offload the request layer. At WebScrapingAPI we built Scraper API for exactly that handoff: keep your Scrapy spiders, your Items, your pipelines, and let a managed endpoint deal with proxies, CAPTCHA solving, and JavaScript rendering. Your spider stays Scrapy. The blocks become someone else's problem.

About the Author
Mihai Maxim, Full Stack Developer @ WebScrapingAPI
Mihai MaximFull Stack Developer

Mihai Maxim is a Full Stack Developer at WebScrapingAPI, contributing across the product and helping build reliable tools and features for the platform.

Start Building

Ready to Scale Your Data Collection?

Join 2,000+ companies using WebScrapingAPI to extract web data at enterprise scale with zero infrastructure overhead.