Back to Blog
Guides
Ștefan RăcilăLast updated on Apr 29, 202611 min read

Scrapy Splash Tutorial: Render JavaScript Pages

Scrapy Splash Tutorial: Render JavaScript Pages
TL;DR: Scrapy Splash pairs Scrapy's fast crawling engine with the Splash headless browser to render JavaScript-heavy pages. This scrapy splash tutorial walks you through Docker setup, Scrapy project configuration, SplashRequest basics, Lua scripts for scrolling and clicking, proxy integration, and fixing the most common errors you will encounter.

Scrapy is one of the most efficient web crawling frameworks in the Python ecosystem, but it has a well-known blind spot: it cannot execute JavaScript. Any site that loads data through client-side rendering, AJAX calls, or single-page application frameworks is invisible to a vanilla Scrapy spider. This is exactly the problem a scrapy splash tutorial solves.

Scrapy Splash is an integration layer between Scrapy and the Splash headless browser. Splash is a lightweight, Qt-based rendering service developed by Zyte (the same team behind Scrapy) that exposes an HTTP API. Instead of running a full desktop browser, Splash loads a page in a stripped-down WebKit engine, executes the JavaScript, and returns fully rendered HTML back to your spider. Your parse methods keep working with standard CSS and XPath selectors as if nothing changed.

In this guide you will set up Docker and Splash from scratch, configure your Scrapy project, write spiders that render dynamic pages, create Lua scripts for advanced interactions, wire up proxies, and troubleshoot the errors that trip up most newcomers.

What Is Scrapy Splash and When Should You Use It?

Splash is a headless browser with an HTTP API that renders JavaScript-loaded web pages. Unlike full-weight browsers, Splash is designed to be lightweight: it spins up inside a Docker container, listens on a port, and returns rendered HTML (or PNG screenshots, or HAR logs) over HTTP. Scrapy Splash is the official library connecting Scrapy to this rendering service.

Reach for Scrapy Splash when your target site loads critical content via JavaScript or AJAX and you need Scrapy's built-in pipeline, middleware, and crawl-management features. Zyte built Splash specifically for scraping workflows, and it plugs into Scrapy's request/response lifecycle without friction. That said, Splash uses an older WebKit engine, so its JavaScript support is not as modern as Chromium-based alternatives. If your target relies on cutting-edge browser APIs, evaluate headless browser tools with a Chromium backend.

For straightforward JS rendering at scale (product pages, directory listings, paginated results), Splash remains a practical and resource-efficient option.

Prerequisites and Environment Setup

Before diving into this scrapy splash tutorial, confirm you have the following:

  • Python 3.7+ installed on your system
  • A Scrapy project (or the intention to create one)
  • Docker Desktop (or the Docker engine on Linux) for running the Splash container
  • A terminal for running Docker and Scrapy CLI commands

That is the full list. The next section covers Docker installation.

Installing Docker and Running the Splash Container

Splash runs inside a Docker container, so Docker needs to be installed first. Grab Docker Desktop for macOS or Windows, or install the Docker engine directly on Linux. Once Docker is running, pull the official Splash image:

docker pull scrapinghub/splash

Then start the container:

docker run -it -p 8050:8050 --rm scrapinghub/splash

This maps port 8050 on the container to port 8050 on your host. The --rm flag removes the container when you stop it.

Open http://localhost:8050/ in your browser. If you see the default Splash interactive page, the service is up. Test a URL right there to confirm rendering works before writing any Scrapy code.

For a scrapy splash docker setup aimed at production, consider resource limits. The --max-timeout flag lets you raise the default 60-second timeout (the maximum is approximately 90 seconds unless you override it, though you should verify the exact value against the current Splash documentation since specifics may vary). Cap memory with Docker's --memory flag to prevent runaway pages from consuming your host.

Configuring Your Scrapy Project for Splash

If you do not have a project yet, create one:

scrapy startproject myproject

Install the scrapy-splash plugin:

pip install scrapy scrapy-splash

Open settings.py and add these entries:

SPLASH_URL = 'http://localhost:8050'

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

SPLASH_URL tells scrapy-splash where the rendering service lives. SplashCookiesMiddleware handles cookie forwarding between Scrapy and Splash. SplashMiddleware intercepts SplashRequest objects and routes them through the Splash HTTP API. The custom DUPEFILTER_CLASS ensures Scrapy's duplicate-request filter accounts for Splash-specific arguments, preventing accidental filtering of requests that differ only in rendering parameters.

With these settings in place, your project is wired up for any scrapy splash tutorial spider you build next.

Your First Scrapy Splash Tutorial Spider: SplashRequest in Action

Generate a spider skeleton:

scrapy genspider quotes_js quotes.toscrape.com

Replace the default start_urls pattern with start_requests, because Scrapy's default request class does not route through Splash:

import scrapy
from scrapy_splash import SplashRequest

class QuotesJsSpider(scrapy.Spider):
    name = 'quotes_js'

    def start_requests(self):
        yield SplashRequest(
            url='http://quotes.toscrape.com/js/',
            callback=self.parse,
            args={'wait': 2}
        )

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }

The key difference from a standard scrapy.Request is that SplashRequest sends the URL to Splash first. Splash renders the page, waits the specified seconds for JavaScript to execute, and returns fully rendered HTML. Inside parse, you work with the response exactly as you would normally: CSS selectors, XPath, everything works because Splash responses carry all the standard response properties.

Run it with scrapy crawl quotes_js and you should see rendered quote data in your output.

Controlling Page Rendering with SplashRequest Arguments

SplashRequest accepts several arguments that control how Splash renders the page:

Argument

Type

Purpose

wait

float

Seconds to wait after the page loads before returning HTML

timeout

float

Max render time (seconds). Default 60, cap at approximately 90 unless overridden

images

int (0/1)

Set to 0 to disable image loading, speeding up renders

resource_timeout

float

Timeout per individual resource (CSS, JS file, image)

http_method

string

Use POST for form submissions

body

string

POST body content, paired with http_method='POST'

For example, to send a POST request (useful for form submissions):

yield SplashRequest(
    url='https://example.com/search',
    args={
        'wait': 1,
        'http_method': 'POST',
        'body': 'query=scrapy+splash',
    },
    callback=self.parse_results,
)

The http_method and body arguments are handy for sites that process search forms or login actions server-side. This covers the scrapy splash javascript rendering basics, but for page interaction (clicking, scrolling, waiting on dynamic elements), you need Lua scripts.

Writing Lua Scripts for Advanced Interactions

The render.html endpoint handles simple cases, but once you need to interact with a page, you move to the execute endpoint with a Lua script. A scrapy splash lua script gives you step-by-step control over the browser:

function main(splash, args)
  splash:go(args.url)
  splash:wait(1)
  return splash:html()
end

Send this via SplashRequest using endpoint='execute' and pass the script in args={'lua_source': script}. From here, layer in element waits, scrolling loops, and click actions.

Waiting for Specific Elements to Load

A fixed wait works when you know the page's load time, but it is brittle. Poll for a specific DOM element instead:

function main(splash, args)
  splash:go(args.url)
  while not splash:select('.target-element') do
    splash:wait(0.5)
  end
  splash:wait(0.5)
  return splash:html()
end

This script loops until splash:select() finds an element matching .target-element, waiting half a second between retries. Once the element appears, one final brief wait handles remaining rendering. This pattern is far more reliable than guessing a static delay.

Scrolling Through Infinite-Scroll Pages

Splash does not have built-in scroll commands. Instead, inject JavaScript to manipulate the scroll position. Here is a Lua script for scrapy splash infinite scroll:

function main(splash, args)
  splash:go(args.url)
  splash:wait(2)
  local scroll_count = 5
  for i = 1, scroll_count do
    splash:runjs("window.scrollTo(0, document.body.scrollHeight)")
    splash:wait(2)
  end
  return splash:html()
end

The script scrolls to the bottom, waits for new content, and repeats. Adjust scroll_count and the wait duration to match the site. Compare document.body.scrollHeight before and after each scroll to detect when no new content appears.

Clicking Buttons and Navigating Pages

"Load More" buttons and pagination links require mouse interaction. Use splash:select() to find the element and trigger a click:

function main(splash, args)
  splash:go(args.url)
  splash:wait(2)
  local btn = splash:select('.load-more-btn')
  if btn then
    btn:mouse_click()
    splash:wait(2)
  end
  return splash:html()
end

Wrap this in a loop for pages with multiple triggers. For pagination, select the "Next" link, click it, wait for the new page, and collect HTML at each step.

Running Custom JavaScript Inside Splash

Sometimes you do not need a full Lua workflow. Splash lets you inject arbitrary JavaScript with two methods: splash:evaljs() (returns a value) and splash:runjs() (executes without returning).

function main(splash, args)
  splash:go(args.url)
  splash:wait(1)
  local title = splash:evaljs("document.title")
  splash:runjs("document.querySelector('.popup-close').click()")
  splash:wait(0.5)
  return {html = splash:html(), title = title}
end

This is useful for dismissing cookie banners, closing modals, or extracting a computed value before grabbing the page HTML. You can also pass JavaScript through the js_source parameter on a standard SplashRequest (no Lua required), which executes the JS after the page loads but before the HTML snapshot is taken.

Using Proxies with Scrapy Splash

Rotating your IP address helps prevent blocks at any meaningful scraping volume. Route requests through a scrapy splash proxy by passing details in SplashRequest arguments:

yield SplashRequest(
    url='https://example.com',
    callback=self.parse,
    args={
        'wait': 2,
        'proxy': 'http://user:pass@proxyhost:port',
    },
)

You can also configure the proxy inside a Lua script using splash:on_request():

function main(splash, args)
  splash:on_request(function(request)
    request:set_proxy{
      host = "proxyhost",
      port = 8080,
      username = "user",
      password = "pass",
    }
  end)
  splash:go(args.url)
  splash:wait(2)
  return splash:html()
end

The Lua approach lets you apply different proxies to different sub-requests within the same page load. Keep in mind that Splash itself does not bypass anti-bot systems; it only renders the page. You still need properly rotated residential or datacenter proxies to avoid IP-level blocks.

Common Errors and Troubleshooting

This is where most scrapy splash tutorial guides leave you hanging. Here are the errors you will hit most often:

Connection refused on localhost:8050. The Splash Docker container is not running. Verify with docker ps. If it is running but unreachable, check that port 8050 is not blocked by your firewall or occupied by another process.

504 Gateway Timeout. The page took longer to render than the allowed timeout. Increase the timeout argument in your SplashRequest. The default cap is approximately 90 seconds. For longer renders, start the container with a higher --max-timeout value (verify against current Splash docs, as specifics may vary between releases).

Lua script errors ("bad argument," "attempt to index a nil value"). These usually mean splash:select() returned nil because the element was not in the DOM yet. Add a wait or polling loop before interacting with it.

Docker container killed (OOM). Splash can consume significant memory on heavy pages. Set Docker memory limits with --memory 2g and disable image loading (images=0). For multiple instances, use Docker Compose with per-container resource constraints.

Blank or incomplete HTML returned. The page's JavaScript may need more time. Increase wait. If third-party resources are slow, set resource_timeout to skip them.

Scrapy Splash vs Scrapy-Playwright vs Selenium

Choosing the right rendering tool depends on your project. Here is how the three most common options in the scrapy splash alternative landscape compare:

Feature

Scrapy Splash

Scrapy-Playwright

Selenium

Browser engine

WebKit (Qt-based)

Chromium, Firefox, WebKit

Chrome, Firefox, Edge

Scrapy integration

Native (scrapy-splash)

Native (scrapy-playwright)

Requires custom middleware

Async support

Limited (HTTP API)

Full async (built on Playwright)

Sync by default

Resource usage

Low to moderate

Moderate

High

Modern JS support

Partial (older WebKit)

Full (Chromium)

Full

Anti-bot bypass

None built-in

None built-in

None built-in

Best for

Lightweight JS rendering at scale

Complex SPAs, modern JS sites

Non-Scrapy projects, testing

Splash is the right pick when you want minimal overhead and your target pages do not rely on bleeding-edge browser APIs. For modern single-page applications, Scrapy-Playwright with its Chromium backend is likely a better fit. Selenium works but lacks native Scrapy integration. None of these tools handle anti-bot protection on their own, so you will still need a proxy layer for production scraping. Use this scrapy splash tutorial as your foundation, and branch out to alternatives when the project demands it.

Key Takeaways

  • Splash runs in Docker and connects to Scrapy via an HTTP API. Once the container is on port 8050 and settings.py is configured, your spiders can render JavaScript pages with a single SplashRequest call.
  • Use Lua scripts when you need interaction. Fixed waits cover simple cases, but polling for elements, scrolling loops, and click actions require the execute endpoint with a Lua script.
  • Proxies are essential for production scraping. Splash renders pages but does not bypass anti-bot protections. Route requests through rotating proxies using SplashRequest arguments or splash:on_request() in Lua.
  • Splash is lightweight but aging. It integrates cleanly with Scrapy, but its WebKit engine lacks support for some modern JavaScript APIs. Evaluate Scrapy-Playwright for sites that need a Chromium backend.
  • Troubleshoot systematically. Most Splash issues boil down to timeouts, missing elements, or Docker resource limits.

FAQ

Can Scrapy Splash handle single-page applications built with React or Vue?

It can render many React and Vue apps, but results depend on which JavaScript APIs the app uses. Splash runs on an older WebKit engine, so apps relying on modern browser features (like IntersectionObserver or ES2020+ syntax) may not render correctly. Test your target URL in the Splash web interface at localhost:8050 before building a full spider.

How much memory does a Splash Docker container need for production scraping?

Plan for at least 1 to 2 GB per instance for typical workloads. Pages with heavy images or complex JavaScript can push memory higher. Disable image loading with images=0 to reduce consumption, and set Docker's --memory flag to prevent a single container from exhausting host resources.

Is Scrapy Splash still maintained, and what are the active alternatives?

Splash receives infrequent updates and is no longer under active feature development. It still works for many use cases, but the community has largely shifted toward Scrapy-Playwright for new projects. Selenium remains an option outside the Scrapy ecosystem. Each tool has trade-offs around browser engine support, async capabilities, and resource usage.

How do you pass cookies or custom headers through SplashRequest?

Include a cookies key in the SplashRequest args dictionary, or set headers using the headers argument. In a Lua script, use splash:set_custom_headers() before calling splash:go(). Cookies from Scrapy's cookie jar are forwarded automatically when SplashCookiesMiddleware is enabled in your settings.

Conclusion

This scrapy splash tutorial has walked you through the complete workflow: standing up a Splash container, configuring your Scrapy project, writing spiders with SplashRequest, authoring Lua scripts for scrolling and clicking, and wiring up proxies. The troubleshooting patterns covered here should save you hours of debugging.

Splash handles straightforward rendering tasks well, but it is older technology at this point. If your targets push the limits of modern JavaScript, evaluate Chromium-based alternatives. Regardless of which rendering tool you choose, the real bottleneck in production scraping is rarely the browser; it is getting past anti-bot defenses and managing proxy infrastructure at scale.

If you would rather skip the infrastructure headaches entirely, WebScrapingAPI handles proxy rotation, CAPTCHA solving, and JavaScript rendering behind a single API endpoint, so you can focus on parsing data instead of fighting the plumbing.

About the Author
Ștefan Răcilă, Full Stack Developer @ WebScrapingAPI
Ștefan RăcilăFull Stack Developer

Stefan Racila is a DevOps and Full Stack Engineer at WebScrapingAPI, building product features and maintaining the infrastructure that keeps the platform reliable.

Start Building

Ready to Scale Your Data Collection?

Join 2,000+ companies using WebScrapingAPI to extract web data at enterprise scale with zero infrastructure overhead.