Splash is a lightweight, headless browser designed specifically for web scraping. It is based on the WebKit engine, which is the same engine that powers the Safari browser. The great thing about Splash is that it's easy to configure, especially if you use Docker. It is also integrated with Scrapy through the scrapy-splash middleware.
In order to use the middleware, you’ll first need to install this package with pip:
$ pip install scrapy-splash
Setting up Splash with Docker is easy. All you need to do is run an instance of Splash on your local machine using Docker (https://docs.docker.com/get-docker/).
$ docker run -p 8050:8050 scrapinghub/splash
After that, you should be able to access the local Splash instance at http://localhost:8050/
Splash has a REST API that makes it easy to use with Scrapy or any other web scraping tool. You can test the server by making a fetch request inside the Scrapy shell:
fetch('http://localhost:8050/render.html?url=<target_url>')
To configure the Middleware, add the following lines to your settings.py file.
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
Visit https://github.com/scrapy-plugins/scrapy-splash to learn more about the each setting.
The easiest way to render requests with Splash is to use scrapy_splash.SplashRequest inside your spider:
import scrapy
from scrapy_splash import SplashRequest
class RandomSpider(scrapy.Spider):
name = 'random_spider'
def start_requests(self):
start_urls = [
'<first_url',
'<second_url>'
]
for url in start_urls:
yield SplashRequest(url=url, callback=self.parse, args={'wait': 5})
def parse(self, response):
result = response.css("h3::text").extract()
yield result
You can add a ‘wait’ parameter to specify the amount of time you want Splash to wait for before returning your request.
One potential drawback of using Splash is that it requires the use of the Lua scripting language to perform actions such as clicking on buttons, filling out forms, and navigating to pages.