Back to Blog
Science of Web Scraping
Suciu DanLast updated on Apr 30, 202627 min read

How to Build a Python Web Crawler: From Start to Scale

How to Build a Python Web Crawler: From Start to Scale
TL;DR: A python web crawler automates the tedious work of following links across a website to discover and collect content. This guide walks you through building one from scratch with requests and BeautifulSoup, then graduating to Scrapy for concurrent crawling, item pipelines, and structured data exports. You will also learn how to crawl responsibly, rotate proxies to avoid blocks, and handle JavaScript-rendered pages.

A python web crawler is a program that automatically navigates websites by following hyperlinks, discovering new pages, and collecting their content along the way. If web scraping is about extracting specific data points from a single page, web crawling is about traversing an entire site (or even multiple sites) to find those pages in the first place.

Python is arguably the most popular language for this job. Between its readable syntax, battle-tested HTTP libraries, and a framework literally named for web spiders, the ecosystem makes crawling accessible without sacrificing power. Whether you need to map every product page on an e-commerce site, build a backlink index for SEO analysis, or feed structured data into machine-learning pipelines, a well-built crawler is the engine that drives the whole process.

This tutorial covers the full lifecycle of building a web crawler in Python: fetching your first page with requests, parsing and extracting links with BeautifulSoup, and then scaling up with Scrapy's spiders, selectors, and item pipelines. Along the way, you will learn how to handle edge cases like relative URLs and JSON APIs, respect robots.txt, throttle your requests, and avoid getting blocked by anti-bot systems. Every section includes runnable code you can copy, adapt, and extend for your own projects. By the end, you will have a clear path from a 20-line prototype to a production-ready crawling pipeline.

What Is a Python Web Crawler and Why Build One?

At its core, a python web crawler is an automated script that starts from one or more seed URLs, fetches the page content, extracts every link it finds, and then repeats the cycle for each new URL. Think of it as a methodical visitor that reads the directory on every floor of a building before deciding which rooms to enter next.

The distinction between crawling and scraping trips people up constantly. Crawling is the discovery phase: finding pages by walking the link graph. Scraping is the extraction phase: pulling structured fields (titles, prices, dates) from pages you have already located. In practice, most projects need both, but they are separate concerns with different tooling requirements. Understanding this distinction helps you choose the right tools and structure your project properly.

So why build one in Python? A few concrete reasons:

  • SEO auditing and backlink mapping: Crawl your own site to find broken links, orphan pages, or missing meta tags. You can also traverse blogs, partner sites, and news outlets to discover who links to you or your competitors.
  • Data collection for ML and analytics: Gather training data from hundreds of pages and pipe it straight into pandas DataFrames, feature stores, or LLM training pipelines. The structured output from a well-designed crawler feeds directly into downstream analysis.
  • Price and inventory monitoring: Walk product category pages nightly to track pricing changes and stock levels across thousands of SKUs.
  • Research and archival: Academic researchers crawl forums, government databases, and public datasets that do not offer bulk download APIs.
  • Content aggregation: News organizations and market research firms crawl industry sites to build curated feeds and competitive intelligence dashboards.

Python's ecosystem (requests, BeautifulSoup, Scrapy, and many more) means you can prototype a working crawler in under 30 lines, then scale the same logic to millions of pages without switching languages. That prototype-to-production path is exactly what this guide covers.

How Web Crawlers Work Under the Hood

Every python web crawler, from a ten-line script to a distributed system, follows the same fundamental loop:

  1. Start with seed URLs. You provide one or more starting addresses. These go into a queue (often called the "frontier").
  2. Fetch the page. The crawler sends an HTTP GET request for the next URL in the queue and receives the HTML response.
  3. Parse the HTML. A parser (BeautifulSoup, lxml, Scrapy selectors) reads the document and exposes its structure as a traversable tree.
  4. Extract links. The parser pulls every <a href="..."> from the page, along with any other discoverable URLs.
  5. Filter and deduplicate. Not every link is worth following. The crawler checks each URL against a seen-URL set, applies domain or path filters, and discards duplicates. This step also includes URL normalization: stripping fragments, sorting query parameters, and lowercasing paths so that example.com/Page and example.com/page are not treated as different URLs.
  6. Enqueue new URLs. Surviving links join the frontier queue.
  7. Repeat until the queue is empty or a stopping condition is met (max depth, max pages, time limit).

This loop is deceptively simple, but the real engineering lives in the details. How do you handle pages that return a 301 redirect to a URL you have already visited? What happens when the server is slow and you have 500 URLs waiting in the queue? How do you avoid crawling the same content under different URL patterns (session IDs, tracking parameters, calendar widgets)?

A naive implementation fetches URLs one at a time, which is fine for a few dozen pages. Once you need to crawl thousands, you need concurrency (multiple requests in flight), persistent queues, and retry logic for transient failures. Simple crawlers that lack retries and fetch pages sequentially are genuinely unsuitable for production-scale work. That is exactly the gap Scrapy was designed to fill: it gives you an asynchronous engine, a built-in scheduler with deduplication, and middleware hooks for every stage of the loop.

Understanding this loop is not just academic. When your crawler misbehaves (misses pages, revisits the same URL, or slows to a halt), the bug almost always maps to one of these seven steps. Diagnosing problems becomes much faster when you can pinpoint which stage is failing.

Choosing the Right Python Crawling Tool

Before writing a single line of code, it pays to pick the right library for your project's scale and complexity. Here is a practical decision matrix for building a web crawler in Python:

Criteria

requests + BeautifulSoup

Scrapy

Managed API service

Setup time

Minutes

15-30 min (project scaffolding)

Minutes (API key)

Concurrency

Manual (threads/asyncio)

Built-in async engine

Handled for you

Deduplication

You build it

Built-in scheduler filter

Handled for you

JS rendering

Not supported

Needs plugin (e.g., scrapy-playwright)

Often included

Data export

Manual (write to file)

CLI flags for JSON/CSV

Varies by provider

Anti-bot handling

DIY (headers, proxies)

Middleware hooks

Built-in proxy rotation, CAPTCHA solving

Best for

Small, one-off crawls

Medium to large, recurring crawls

Sites with heavy bot defenses or JS rendering

requests + BeautifulSoup is the go-to combination when you need a quick prototype or a crawler that hits a handful of pages. You control every detail, which is great for learning and terrible for scaling. Basic crawlers built this way often re-visit the same pages or get stuck following repeated links without careful deduplication logic.

Scrapy is an entire framework purpose-built for web crawling at scale. It handles concurrency, retries, deduplication, and data pipelines out of the box. The trade-off is a steeper learning curve and an opinionated project structure. But once you internalize the spider/pipeline pattern, building new crawlers becomes remarkably fast.

Managed API services make sense when the hard part of your crawl is not the parsing logic but the infrastructure: rotating proxies, solving CAPTCHAs, rendering JavaScript. Instead of maintaining that stack yourself, you send a request and get HTML (or JSON) back.

Pick the simplest option that meets your requirements. You can always upgrade later, and this guide will show you how to progress through each tier.

Setting Up Your Python Environment

A clean environment prevents dependency conflicts and keeps your project reproducible. Here is the minimal setup for your python web crawler project:

# Create and activate a virtual environment
python3 -m venv crawler-env
source crawler-env/bin/activate   # macOS / Linux
crawler-env\\Scripts\\activate      # Windows

# Install core libraries
pip install requests beautifulsoup4 lxml scrapy

Your project folder should look something like this:

my-crawler/
├── crawler-env/
├── simple_crawler.py      # requests + BS4 version
├── scrapy_project/        # generated by scrapy startproject
│   ├── scrapy_project/
│   │   ├── spiders/
│   │   ├── items.py
│   │   ├── pipelines.py
│   │   └── settings.py
│   └── scrapy.cfg
└── requirements.txt

Pin your dependencies with pip freeze > requirements.txt so anyone cloning the repo gets the same versions. If you plan to use Scrapy with a headless browser for JavaScript-rendered pages, add scrapy-playwright to the install list as well.

lxml is included as the parser backend for BeautifulSoup. It is significantly faster than Python's built-in html.parser and handles malformed HTML more gracefully, which matters when you are crawling pages whose markup was clearly not written by humans.

With these packages in place, you are ready to write code. The next section builds a complete working crawler from scratch.

Building a Basic Python Web Crawler with Requests and BeautifulSoup

Time to write actual code. The goal of this first crawler is simple: start from a seed URL, fetch the page, find every link on it, and then visit each of those links while staying on the same domain. It is intentionally minimal so you can see the moving parts before adding complexity.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import time

def crawl(seed_url, max_pages=20, delay=1):
    visited = set()
    queue = [seed_url]
    allowed_domain = urlparse(seed_url).netloc

    while queue and len(visited) < max_pages:
        url = queue.pop(0)
        if url in visited:
            continue

        try:
            response = requests.get(
                url,
                headers={"User-Agent": "MyCrawler/1.0 (contact@example.com)"},
                timeout=10,
            )
            response.raise_for_status()
        except requests.RequestException as e:
            print(f"Failed to fetch {url}: {e}")
            continue

        visited.add(url)
        print(f"Crawled: {url} ({response.status_code})")

        soup = BeautifulSoup(response.text, "lxml")

        for anchor in soup.find_all("a", href=True):
            link = urljoin(url, anchor["href"])
            parsed = urlparse(link)
            # Strip fragments and stay on the same domain
            clean_link = parsed._replace(fragment="").geturl()
            if parsed.netloc == allowed_domain and clean_link not in visited:
                queue.append(clean_link)

        time.sleep(delay)  # Be polite

    print(f"Done. Visited {len(visited)} pages.")
    return visited

if __name__ == "__main__":
    crawl("https://example.com")

Let's walk through the key decisions in this python web crawler:

  • visited set: This is your deduplication mechanism. Before fetching any URL, you check whether it is already in the set. Without this, the crawler would loop forever on sites with circular navigation links. Even small sites can have navigation menus that create cycles.
  • urljoin: Converts relative paths like /about into absolute URLs like example.com/about. This is critical because most sites use relative hrefs in their navigation and internal linking.
  • Domain filtering: The urlparse check keeps the crawler on a single domain. Without it, one external link could send your crawler spiraling across the entire internet. This is the difference between a focused crawl and an uncontrolled one.
  • Fragment stripping: The _replace(fragment="") call removes #section anchors from URLs. These point to different positions on the same page, not different pages, so treating them as distinct URLs would cause redundant fetches.
  • max_pages cap: A safety net that is especially important during development. You do not want to accidentally fire off thousands of requests while debugging your parser.
  • time.sleep(delay): A basic politeness measure. Even one second between requests makes a significant difference to the target server's load.
  • Custom User-Agent: Identifying your crawler with a descriptive header (including contact information) is both ethical and practical. Sites are far less likely to block a crawler that identifies itself honestly.
  • Error handling: The try/except block catches connection timeouts, DNS failures, and HTTP errors (via raise_for_status). A production crawler needs this; a crashed process means lost progress and potentially incomplete data.

This crawler is synchronous, meaning it fetches one page at a time. For 20 pages that is perfectly fine. For 2,000 it would be painfully slow. We will address that limitation when we move to Scrapy later in this guide.

The link extraction logic in the basic crawler works, but real-world pages are messy. Anchor tags point to PDFs, mailto addresses, JavaScript void calls, fragment-only links, and query-string variations of the same page. A smarter filter saves you from wasting requests on junk URLs and keeps your crawl focused.

from urllib.parse import urljoin, urlparse

IGNORED_EXTENSIONS = {".pdf", ".jpg", ".png", ".gif", ".zip", ".exe", ".mp4"}

def extract_links(soup, base_url, allowed_domain):
    links = set()
    for anchor in soup.find_all("a", href=True):
        raw = anchor["href"]

        # Skip non-HTTP schemes
        if raw.startswith(("mailto:", "javascript:", "tel:", "#")):
            continue

        full_url = urljoin(base_url, raw)
        parsed = urlparse(full_url)

        # Strip fragments for deduplication
        clean = parsed._replace(fragment="").geturl()

        # Domain filter
        if parsed.netloc != allowed_domain:
            continue

        # Extension filter
        if any(clean.lower().endswith(ext) for ext in IGNORED_EXTENSIONS):
            continue

        links.add(clean)
    return links

This function handles the most common edge cases when you crawl a website with Python: it resolves relative URLs, strips fragment identifiers (the #section part that does not change the actual page), ignores non-HTTP schemes, and skips binary file extensions. The result is a clean set of same-domain URLs ready for the crawl queue.

For even stronger duplicate URL detection, consider normalizing query parameters by sorting them alphabetically. Two URLs that differ only in parameter order (?a=1&b=2 vs. ?b=2&a=1) typically return the same content, and treating them as distinct wastes bandwidth. You can also apply canonical URL handling by checking for <link rel="canonical"> tags in the HTML, which tell you the preferred URL for a piece of content.

Another useful technique is URL pattern detection. If you notice the crawler generating thousands of URLs that match a pattern like /calendar?date=2024-01-01, /calendar?date=2024-01-02, and so on, you can add a regex deny-list to short-circuit those paths before they ever enter the queue.

Handling Relative URLs and Edge Cases

Relative URLs are the most common source of bugs in a first-time python web crawler. A page at example.com/blog/ might contain links like ../about, ./post-1, or even //cdn.example.com/image.png. Python's urllib.parse.urljoin handles all of these correctly, which is why it showed up in every code example so far.

Beyond relative paths, watch out for these edge cases:

  • Redirect chains: A 301 or 302 redirect means the final URL differs from the one you requested. Use response.url (not the original request URL) when adding to your visited set, or you will crawl the same page twice under different addresses.
  • Soft 404s: Some sites return a 200 status but serve a generic "page not found" body. If you are extracting data, check for a content marker (like a product title) before treating the page as valid.
  • URL-encoded characters: %20 vs. a literal space, %2F vs. /. Normalize these before deduplication to avoid treating encoded and unencoded variants as separate URLs.
  • Infinite URL patterns: Calendar widgets, session IDs in the path, or filter combinations can generate an unlimited number of unique-looking URLs that all serve similar content. Set a maximum crawl depth or use URL pattern detection to break the loop.

Handling these up front saves hours of debugging later. A crawler that silently double-counts pages or misses content because of a trailing slash is harder to fix than one that crashes loudly on a malformed URL.

Crawling and Parsing JSON APIs

Not every website serves its data as HTML. Many modern web applications load content from internal APIs that return JSON instead of embedding data directly in the page markup. If you inspect network requests in your browser's developer tools (the Network tab, filtered by XHR/Fetch), you will often find endpoints that hand you structured data without any HTML parsing needed.

Here is a pattern for crawling a paginated JSON API:

import requests
import json
import time

def crawl_json_api(base_url, max_pages=10, delay=1):
    all_items = []
    page = 1

    while page <= max_pages:
        response = requests.get(
            base_url,
            params={"page": page, "per_page": 50},
            headers={"Accept": "application/json"},
            timeout=10,
        )
        response.raise_for_status()
        data = response.json()

        items = data.get("results", [])
        if not items:
            break  # No more data

        all_items.extend(items)
        print(f"Page {page}: fetched {len(items)} items")

        # Check for explicit pagination metadata
        if not data.get("has_next", True):
            break

        page += 1
        time.sleep(delay)

    return all_items

# Example usage
items = crawl_json_api("https://api.example.com/products")
print(f"Total items collected: {len(items)}")

The approach is nearly identical to HTML crawling, with two key differences. First, you skip the HTML parsing step entirely because response.json() gives you a native Python dictionary. Second, pagination is usually explicit: the API either returns a next URL, a has_next flag, or you increment a page parameter until the results array comes back empty.

When the API requires authentication (an API key or session token), pass it via headers rather than query parameters to avoid leaking credentials in server logs. And always check for rate-limit headers (X-RateLimit-Remaining, Retry-After). Respecting those headers is both polite and practical, because the server will cut you off if you ignore them.

This approach pairs well with tools like pandas. Once you have a list of dictionaries from your JSON crawl, loading them into a DataFrame for analysis or export is a single pd.DataFrame(items) call. You can then use pandas to clean, filter, and analyze the collected data or train machine-learning models on it directly.

Scaling Up with Scrapy

Once your crawling needs outgrow a synchronous requests loop, Scrapy is the natural next step. It is a full-featured Python framework designed specifically for web crawling at scale, and it handles the hard infrastructure problems (concurrency, retries, deduplication, and throttling) so you can focus on the parsing logic.

Scrapy's architecture has five core components that work together:

  • Engine: The central coordinator. It passes requests to the Downloader and responses to Spiders, orchestrating the entire crawl cycle.
  • Scheduler: Manages the request queue and deduplicates URLs automatically using fingerprinting. You never build a visited set by hand.
  • Downloader: Sends HTTP requests asynchronously using Twisted's event loop, which means hundreds of requests can be in flight simultaneously without threading overhead.
  • Spiders: Your code. Each spider class defines which URLs to start from and how to parse each response. This is where your domain-specific logic lives.
  • Item Pipelines: Post-processing stages that clean, validate, and store the data your spiders extract. You can chain multiple pipelines for different concerns.

What makes this architecture powerful is that every component is pluggable. Need custom headers on every request? Write a downloader middleware. Want to drop items with missing fields? Add a validation pipeline. Need to funnel results into a database instead of a JSON file? Swap the default exporter for a custom pipeline.

Scrapy can process pages significantly faster than a single-threaded requests loop. The framework handles concurrent requests across multiple domains while staying within the throttle limits you define. It also respects robots.txt by default (via the ROBOTSTXT_OBEY setting), which is something many hand-rolled crawlers forget to implement.

The trade-off is complexity. Scrapy has an opinionated project structure, a learning curve, and its own vocabulary (spiders, items, pipelines, middlewares). But once you internalize the pattern, you can build production-grade crawlers remarkably quickly. The rest of this guide shows you how.

Creating a Scrapy Project and Your First Spider

Let's scaffold a real Scrapy project. Open a terminal inside your virtual environment and run:

scrapy startproject bookstore
cd bookstore
scrapy genspider books books.toscrape.com

This generates the full directory tree: settings.py, items.py, pipelines.py, and a spiders/ folder with your new books.py spider. The genspider command pre-fills the spider with the correct allowed_domains and a starter start_urls list. Open that spider file and replace the boilerplate with a working parser:

import scrapy

class BooksSpider(scrapy.Spider):
    name = "books"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com/"]

    def parse(self, response):
        for book in response.css("article.product_pod"):
            yield {
                "title": book.css("h3 a::attr(title)").get(),
                "price": book.css(".price_color::text").get(),
                "availability": book.css(
                    ".instock.availability::text"
                ).getall()[-1].strip(),
            }

        # Follow the "next" pagination link
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

Run the spider with:

scrapy crawl books -o books.json

That single command starts the engine, fetches every paginated listing page, extracts book data, and writes the results to a JSON file. No manual loop, no visited set, no file-writing boilerplate. This is the power of a python web crawler built on a proper framework.

A few things worth noting about this spider:

  • allowed_domains restricts the crawler to the target site. Any off-domain links discovered during parsing are silently dropped, preventing your spider from wandering to external sites.
  • response.css() uses CSS selectors on the response body. Scrapy parses the HTML once and caches the parsed tree, so calling multiple selectors on the same response is cheap.
  • yield instead of return: Scrapy spiders are generators. You yield items (dictionaries or Item objects) and requests. The engine decides when and how to schedule them, managing concurrency for you.
  • response.follow() handles relative URLs internally (no urljoin needed) and automatically deduplicates against previously seen URLs through the scheduler's fingerprint filter.

This is a complete, working web crawling spider in about 20 lines of code. Everything else (HTTP handling, scheduling, concurrent downloads, export) is managed by the Scrapy framework. From here, you can extend the spider with deeper link following, additional parse callbacks, and pipeline processing.

Pagination is one of the most common patterns when building a python web crawler. Most listing sites split results across pages, and your spider needs to follow those "Next" links until they run out. The Scrapy spider above already demonstrates the basic approach, but let's look at a more robust version that handles deeper link structures using Scrapy's LinkExtractor.

import scrapy
from scrapy.linkextractors import LinkExtractor

class DeepCrawlSpider(scrapy.Spider):
    name = "deepcrawl"
    start_urls = ["https://example.com/catalog"]
    allowed_domains = ["example.com"]

    link_extractor = LinkExtractor(
        allow=r"/catalog/",
        deny=[r"/login", r"/cart", r"/account"],
    )

    def parse(self, response):
        # Extract data from the current page
        for product in response.css(".product-card"):
            yield {
                "name": product.css("h2::text").get(),
                "url": response.urljoin(product.css("a::attr(href)").get()),
                "price": product.css(".price::text").get(),
            }

        # Follow all matching links found on the page
        for link in self.link_extractor.extract_links(response):
            yield scrapy.Request(link.url, callback=self.parse)

LinkExtractor is Scrapy's utility for pulling links from a page based on regex patterns. The allow parameter keeps only URLs matching /catalog/, while deny filters out login, cart, and account pages that would waste requests. This is far more maintainable than hand-coding URL checks inside your parse method, especially as the number of exclusion patterns grows.

For sites that use "Load More" buttons instead of traditional pagination links, you will typically find an underlying API endpoint in the Network tab. Construct the next request manually by incrementing a page or offset parameter, exactly like the JSON API crawling pattern discussed earlier in this guide.

A common mistake is forgetting to set a depth limit. Scrapy's DEPTH_LIMIT setting caps how many link-hops from the seed URL the crawler will follow. Without it, a spider on a large site can queue millions of URLs before you notice. Start with a conservative limit (3-5) during development, then increase it once your spider is stable and you have confidence in your filtering logic.

Another useful technique is combining CrawlSpider (a built-in Scrapy class) with Rule objects. This approach lets you define link-following rules declaratively, separating the navigation logic from the data extraction logic. It makes complex multi-level crawls easier to reason about.

XPath vs CSS Selectors for Data Extraction

Scrapy supports both CSS selectors and XPath expressions for parsing HTML, and you can mix them freely within the same spider. Knowing when to reach for each one saves time and keeps your selectors readable.

Feature

CSS Selectors

XPath

Syntax

Familiar to front-end devs

XML query language

Text extraction

::text pseudo-element

text() function

Attribute access

::attr(href)

@href

Parent traversal

Not supported

.. or ancestor:: axis

Conditional logic

Limited (:nth-child, :not)

Rich (contains(), starts-with(), boolean operators)

Readability

Generally more concise

Verbose but more expressive

Use CSS selectors when you are targeting elements by class, ID, or simple hierarchy. They are shorter, easier to read, and sufficient for the majority of extraction tasks:

# CSS: get all product titles
titles = response.css("h3.product-title::text").getall()

# CSS: get the href of every link inside a nav element
nav_links = response.css("nav a::attr(href)").getall()

Use XPath when you need to navigate upward in the DOM tree, match partial text content, or apply conditional logic that CSS cannot express:

# XPath: find links whose visible text contains "Next"
next_link = response.xpath('//a[contains(text(), "Next")]/@href').get()

# XPath: get the parent div of a specific span
parent = response.xpath('//span[@class="price"]/..').get()

# XPath: select items where the price is not empty
priced_items = response.xpath(
    '//div[@class="product"][.//span[@class="price" and text()]]'
).getall()

In practice, most Scrapy spiders use CSS selectors for roughly 80% of their extraction work and switch to XPath for the remaining edge cases where CSS falls short. Scrapy converts CSS selectors to XPath internally before executing them, so there is no performance difference between the two approaches. Choose whichever makes your intent clearer for each specific extraction task.

One practical tip: when debugging selectors, use scrapy shell "https://target-url.com" to open an interactive session. You can test both CSS and XPath expressions against a live page without running your full spider, which speeds up development significantly.

Exporting Crawled Data to JSON and CSV

Scrapy's built-in feed exports handle the most common output formats with zero custom code. You already saw the basic export command:

# Export to JSON
scrapy crawl books -o output.json

# Export to CSV
scrapy crawl books -o output.csv

# Export to JSON Lines (one JSON object per line, better for large datasets)
scrapy crawl books -o output.jsonl

The -o flag appends to the file if it already exists, which can produce malformed JSON on repeat runs. Use -O (capital O, available in Scrapy 2.3+) to overwrite instead.

For more control over your data export pipeline, configure exports in settings.py:

FEEDS = {
    "data/books.json": {
        "format": "json",
        "encoding": "utf-8",
        "indent": 2,
        "overwrite": True,
    },
    "data/books.csv": {
        "format": "csv",
        "fields": ["title", "price", "availability"],
    },
}

The FEEDS dictionary lets you output to multiple formats simultaneously, control field ordering in CSV, and set encoding. This is particularly useful when different consumers need different formats: your analytics team wants CSV with specific columns, your API consumers want JSON Lines for streaming ingestion, and your archive needs a pretty-printed JSON snapshot.

JSON Lines format (.jsonl) deserves special attention for larger crawls. Unlike standard JSON, which wraps everything in a single array, JSON Lines writes one complete JSON object per line. This means you can stream-process the file line by line, append new records without re-parsing the whole file, and recover partial results if a crawl crashes midway.

If you need to push data into a database, a message queue, or a cloud storage bucket, skip the file exporter entirely and write a custom item pipeline. That approach gives you full control over validation, transformation, and storage logic.

Cleaning Data with Scrapy Item Pipelines

Raw crawled data is almost never clean. Prices have trailing whitespace, titles include stray newlines, and some pages return incomplete records. Scrapy's item pipeline system lets you process every item between extraction and export, ensuring your output is consistent and valid.

Here is a pipeline that handles the three most common cleaning tasks:

from scrapy.exceptions import DropItem

class CleaningPipeline:
    def __init__(self):
        self.seen_titles = set()

    def process_item(self, item, spider):
        # 1. Strip whitespace from all string fields
        for field in item:
            if isinstance(item[field], str):
                item[field] = item[field].strip()

        # 2. Validate required fields
        if not item.get("title"):
            raise DropItem(f"Missing title: {item}")

        # 3. Drop duplicates based on title
        if item["title"] in self.seen_titles:
            raise DropItem(f"Duplicate: {item['title']}")
        self.seen_titles.add(item["title"])

        return item

To activate the pipeline, register it in settings.py:

ITEM_PIPELINES = {
    "bookstore.pipelines.CleaningPipeline": 300,
}

The integer (300) is the priority. Lower numbers run first, so you can chain multiple pipelines: a cleaning pipeline at 300, a validation pipeline at 400, and a database-write pipeline at 500. Each pipeline receives the item, processes it, and either returns the modified item (passing it to the next pipeline) or raises DropItem to discard it entirely.

Raising DropItem removes the item from the output and logs a message. This is cleaner than filtering after the fact because dropped items never reach the exporter or database. You can monitor your crawl's drop rate to identify parsing problems early.

For projects that extract data from dozens of different page types, consider defining formal Scrapy Items (or dataclass-based item loaders) instead of plain dictionaries. Items enforce a schema, provide default values, and work with Scrapy's item loader processors for field-level transformations like MapCompose(str.strip, str.lower). This is especially valuable on team projects where multiple developers write spiders against the same data model.

Crawling Responsibly: Robots.txt, Rate Limits, and Ethics

A python web crawler that ignores a site's rules will eventually get blocked, and in some jurisdictions, it could create legal exposure. Responsible crawling is not just good manners; it is a practical requirement for any crawler that needs to run reliably over time.

Respecting robots.txt

The robots.txt file lives at the root of every website (e.g., https://example.com/robots.txt) and tells crawlers which paths are off-limits and how fast they should make requests. Here is how to parse it programmatically with Python's standard library:

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

if rp.can_fetch("*", "https://example.com/private/data"):
    print("Allowed")
else:
    print("Blocked by robots.txt")

crawl_delay = rp.crawl_delay("*")
print(f"Recommended delay: {crawl_delay} seconds")

Scrapy checks robots.txt automatically when ROBOTSTXT_OBEY = True (the default setting). If you are using requests and BeautifulSoup, you need to implement this check yourself before every fetch.

Configuring Crawl Delays and Throttling

Even if robots.txt does not specify a crawl delay, hammering a server with hundreds of concurrent requests is a fast way to get IP-banned. In Scrapy, three settings control your crawl pace:

# settings.py
DOWNLOAD_DELAY = 1                      # seconds between requests
CONCURRENT_REQUESTS_PER_DOMAIN = 8      # max parallel requests to one domain
AUTOTHROTTLE_ENABLED = True             # dynamically adjusts delay based on load
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0   # target number of parallel requests

AUTOTHROTTLE is particularly useful because it adapts automatically based on server response times. If the server responds quickly, Scrapy speeds up. If response times spike (indicating server strain), it backs off. This balances throughput with politeness without requiring you to guess the right fixed delay.

Ethical Guidelines

Beyond the technical settings, follow these principles:

  • Identify your crawler with a descriptive User-Agent string that includes contact information.
  • Do not crawl pages behind login walls or paywalls unless you have explicit permission.
  • Cache responses locally during development so you are not hitting live servers on every test run.
  • Honor the Robots Exclusion Protocol as a baseline, even when it is not legally binding in your jurisdiction.

When you scale up, remember that websites are not built to handle hundreds of simultaneous bot requests. Overloading a server affects real users, and responsible crawling ensures you can return to the same site tomorrow without finding your IP on a blocklist.

Avoiding Blocks: User Agents, Proxies, and Anti-Bot Strategies

Even polite crawlers get blocked. Websites deploy anti-bot systems that look for patterns: repeated requests from one IP, missing or generic User-Agent headers, and request timing that no human would produce. Here is how to make your python web crawler more resilient.

User-Agent Rotation

The default User-Agent for most HTTP libraries identifies them as bots. Sites that filter on this header will reject your requests immediately. Set a realistic browser-like header:

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/124.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers, timeout=10)

For longer crawl sessions, rotate through a list of User-Agent strings to avoid presenting the same fingerprint on every request. In Scrapy, the scrapy-fake-useragent middleware handles this rotation automatically.

Proxy Rotation

IP-based rate limits are the most common blocking mechanism. If all your requests come from a single address, the site sees the pattern instantly. Routing traffic through rotating proxies distributes your requests across many IPs, making each one look like an independent visitor.

Residential proxies are particularly effective because they use IP addresses assigned to real households, making them virtually indistinguishable from regular user traffic. Datacenter IPs, while faster and cheaper, are easier for anti-bot systems to fingerprint and block in bulk.

Recognizing When You Are Blocked

Before investing in countermeasures, learn to recognize the symptoms:

  • HTTP 403 or 429: Explicit denial or rate-limit response.
  • Redirect to a CAPTCHA page: The server wants proof you are human.
  • Empty or placeholder HTML: The page loads but contains no meaningful content, just a skeleton or a "please wait" message.
  • Sudden response time spikes: The server is intentionally slowing you down (a technique called tarpitting).

When you detect a block, back off before retrying. Exponential backoff (wait 1s, then 2s, then 4s, then 8s) is a reasonable default. Scrapy's retry middleware handles transient failures automatically, but persistent blocks usually require a change in strategy: different IPs, slower request rates, or a rendering layer for sites that serve content only to real browsers.

Anti-bot defenses are an arms race. Once JavaScript challenges, browser fingerprinting, and CAPTCHAs are involved, simple crawling scripts hit a wall. The pragmatic choice is often to offload that complexity to a purpose-built service rather than maintaining your own proxy pool and browser automation infrastructure.

Handling JavaScript-Rendered Pages

A growing number of websites rely on client-side JavaScript to render their content. When you fetch one of these pages with requests, you get an HTML shell with empty <div> containers and a bundle of JavaScript. The actual data loads after the scripts execute in a browser environment, which means traditional HTTP-based crawlers see nothing useful.

You have three main options for dealing with this when building a web crawler in Python:

1. Find the underlying API. Before reaching for a headless browser, open your browser's developer tools and check the Network tab. Many single-page applications fetch data from a JSON API that you can call directly, bypassing the rendering problem entirely. This is the fastest and most resource-efficient approach when it works.

2. Use a headless browser. Tools like Playwright and Puppeteer let you control a real (headless) Chrome or Firefox instance from your code. The browser executes JavaScript, waits for content to render, and then you extract data from the fully loaded DOM. Scrapy integrates with Playwright through the scrapy-playwright plugin, which lets you selectively mark specific requests for browser rendering while keeping the rest as fast, lightweight HTTP calls:

# In a Scrapy spider, mark a request for Playwright rendering
yield scrapy.Request(
    url,
    meta={"playwright": True, "playwright_page_methods": [
        {"method": "wait_for_selector", "args": [".product-list"]},
    ]},
    callback=self.parse_products,
)

3. Use a managed rendering service. If you do not want to run and maintain headless browser infrastructure (which consumes significant memory and CPU), managed API services can handle the rendering for you. They return the fully loaded HTML so you can parse it with your existing BeautifulSoup or Scrapy selectors.

The right choice depends on volume and complexity. For a few hundred JS-heavy pages, a local headless browser is perfectly manageable. For thousands of pages across sites with anti-bot protections, the operational overhead of managing browser instances, handling memory leaks, and recovering from crashes adds up quickly.

Simplifying Complex Crawls with a Managed API

At some point, the hardest part of building a python web crawler stops being the parsing logic and starts being everything else: maintaining proxy pools, solving CAPTCHAs, rotating browser fingerprints, and keeping up with anti-bot systems that update their defenses weekly. When the infrastructure burden outweighs the extraction work, it makes sense to offload that layer and focus on what your code actually cares about: the data.

A managed API service sits between your crawler and the target website. You send a request with the target URL, and the service handles proxy rotation, JavaScript rendering, retries, and anti-bot countermeasures behind the scenes. What comes back is clean HTML (or structured JSON) that you parse with the same BeautifulSoup or Scrapy code you already have. Your crawling logic does not change; only the fetch layer does.

This approach is especially practical when:

  • You are crawling sites with aggressive bot detection that blocks datacenter IPs within minutes.
  • You need JavaScript rendering at scale but do not want to manage a fleet of headless browser instances and their associated memory and CPU costs.
  • Your team's engineering time is better spent on data analysis and pipeline development than on proxy infrastructure maintenance.
  • You need to crawl across many different target sites, each with its own anti-bot stack, making a one-size-fits-all local solution impractical.

The trade-off is cost. You are paying per successful request instead of running your own infrastructure. For high-volume, long-running crawls where the targets are not heavily defended, the economics tilt back toward self-managed setups. For most projects that involve bot-protected sites, though, the developer time saved more than offsets the per-request fee.

Key Takeaways

  • Start simple, then scale intentionally. A basic requests + BeautifulSoup crawler is enough for small jobs and learning. Move to Scrapy when you need concurrency, automatic deduplication, and structured data pipelines.
  • Deduplication is non-negotiable. Use a seen-URL set with proper normalization (or let Scrapy's scheduler handle it) to prevent infinite loops and wasted bandwidth.
  • Crawl responsibly every time. Respect robots.txt, configure crawl delays, use AUTOTHROTTLE, and identify your bot with a descriptive User-Agent. This protects both the target site and your own IP reputation.
  • Handle JavaScript intentionally. Check for underlying APIs first, use headless browsers when necessary, and consider managed services when you need JS rendering at scale.
  • Clean data during the crawl, not after. Scrapy's item pipelines let you validate, deduplicate, and transform records before they ever reach your export file or database.

FAQ

What is the difference between web crawling and web scraping?

Crawling is the discovery process: an automated program follows hyperlinks across pages to map a site's structure and find URLs. Scraping is the extraction step: pulling specific data fields (prices, titles, dates) from pages that have already been located. Most real-world projects combine both, but they solve different problems and often benefit from different tools and strategies.

It depends on the jurisdiction, the website's terms of service, and the type of data you collect. In the United States, the 2022 hiQ v. LinkedIn ruling affirmed that accessing publicly available data is not a violation of the Computer Fraud and Abuse Act. However, terms-of-service restrictions, copyright law, and privacy regulations like GDPR may still apply. Always consult legal counsel before crawling at scale, especially for commercial use.

How do I handle JavaScript-heavy websites when crawling?

Check for an underlying API first by inspecting the browser's Network tab for XHR/Fetch requests. If the data is only available after client-side rendering, use a headless browser like Playwright or Puppeteer to execute JavaScript and extract the fully rendered DOM. For high-volume JS crawling, a managed rendering service can handle browser orchestration so you do not need to maintain that infrastructure yourself.

How can I prevent my Python crawler from getting blocked?

Rotate User-Agent strings, use residential proxies to distribute requests across multiple IPs, add random delays between requests, and respect robots.txt crawl-delay directives. Monitor response codes closely: a spike in 403 or 429 responses means the site has detected your traffic pattern. Backing off and reducing concurrency is almost always more effective than trying to push through blocks with brute force.

When should I use Scrapy instead of requests and BeautifulSoup?

Use Scrapy when your crawl involves more than a few hundred pages, needs concurrent requests, requires built-in deduplication, or benefits from structured data pipelines and export. For quick, one-off scripts targeting a small number of pages, requests and BeautifulSoup are faster to set up and simpler to debug. If your project grows beyond a single script file, Scrapy's architecture will save you from reinventing its features.

Conclusion

Building a python web crawler is a progression, not a single step. You start with a handful of lines using requests and BeautifulSoup to understand the fetch-parse-extract loop. From there, you move to Scrapy to get concurrency, automatic deduplication, selector flexibility, and pipeline-based data cleaning without writing that plumbing yourself.

The fundamentals stay the same regardless of scale: respect the sites you crawl, deduplicate aggressively, handle errors gracefully, and clean your data before storing it. When the target sites fight back with CAPTCHAs, IP blocks, or JavaScript-only rendering, you have a clear decision tree: check for an underlying API first, use headless browsers for moderate volumes, and lean on managed services for heavy-duty work.

If you find yourself spending more time on proxy rotation, CAPTCHA solving, and anti-bot workarounds than on actual data processing, WebScrapingAPI can handle that infrastructure layer for you. It manages proxies, JavaScript rendering, and retries behind a single endpoint, so your Scrapy spiders or BeautifulSoup scripts keep working with minimal code changes. That way, you stay focused on what the data is telling you, not on how to get it through the door.

About the Author
Suciu Dan, Co-founder @ WebScrapingAPI
Suciu DanCo-founder

Suciu Dan is the co-founder of WebScrapingAPI and writes practical, developer-focused guides on Python web scraping, Ruby web scraping, and proxy infrastructure.

Start Building

Ready to Scale Your Data Collection?

Join 2,000+ companies using WebScrapingAPI to extract web data at enterprise scale with zero infrastructure overhead.