How to Run a Headless Browser with Python for Web Scraping: Tips and Tricks
Mihnea-Octavian Manolache on Feb 03 2023
Using a Python headless browser with Selenium is almost the norm in web scraping. But what is really a headless browser? And what is the best headless browser for Selenium? And why even use a headless browser in Python when you have `requests`? Well there’s a lot of questions around this topic. Which means we have a lot of answers to discover together. But before digging deeper, let’s trace some learning objectives. By the end of this article, you should be able to:
- Understand what is a headless browser and its use cases
- Know how to open a headless browser in Python
- Create a web scraper with Python and Selenium
And finally, we’ll also talk about alternatives to Python headless browsers. Even though the focus is on Python, my goal is to discover the best scraping solution. And that accounts for response time, resources used etc. So, without further ado, let’s jump into the subject!
What does Python headless browser mean?
On a high level, a browser is a computer program that allows users to navigate and interact with a web page. A headless browser is just that, but without a graphical user interface. Which means that a Python headless browser is a program that can:
- Navigate to any website on the internet
- Interact with the components of that web page
Understanding that there’s no GUI associated with it raises some questions on interaction. Yet the answer is quite simple. Because there is no GUI, humans cannot directly interact with the page. And that is where web drivers come into play. A web driver is an interface that allows for introspection and control. Simply put, web drivers are frameworks that allow us to programmatically control various web browsers.
There are a couple of frameworks that allow for browser automation in Python. But the main one is Selenium. Selenium is a suite of tools primarily built for automated testing. But in practice, it’s widely used for web scraping as well.
Why use a headless browser in Python?
According to Selenium’s front page header:
“Selenium automates browsers. That's it! What you do with that power is entirely up to you.”
Which leads us to believe that automated browsers have various use cases. But why run them in headless mode? Well the answer is yet again simple. A Python headless browser uses less resources (CPU and memory) compared to a headful browser. And that is mainly because there are no graphical elements to be drawn.
While you can’t do that with `requests`, you can do it with a headless browser. And that answers one of our initial questions. Modern web scrapers use headless browsers instead of requests because otherwise, the response would be inconclusive.
What are the drawbacks of a headless browser?
The main disadvantage of a Python headless browser (and mostly all automated browsers) is their fingerprint. If you follow my articles, you know I sometimes speak about stealthiness. That is the ability of an automated browser to go undetected.
And in Python, headless browsers are easily distinguishable. To start with, checking a simple property of the browser such as `navigator.webdriver ` is an instant tell if a browser is controlled by a web driver. In web scraping, one of the main challenges is finding ways to avoid detection. We call these evasion methods or techniques. You can read more about it here [LINK](TREBUIE POSTAT ARTICOLUL).
At Web Scraping API for example, we have a dedicated team working constantly on our Stealth Mode. That’s to ensure our browser’s fingerprint is unique and undetectable with every request.
Headless browsers available with Python Selenium
In the world of web scraping, the most used Python headless browsers are Chrome and Firefox. I think that is mainly because these two browsers are both performance and cross platform. You can for example develop your web scraping project in a MacOS environment and then easily deploy it on Linux.
How to open a headless browser in Python
Now that we covered some theoretical concepts, I think it’s safe to go ahead and explore the practical part. In this section, I will show you how to build a web scraper with Selenium. For this project, make sure your machine is equipped with python and Chrome.
#1: Set up the environment
As usual, in Python we should encapsulate everything inside a virtual environment. If you’re not familiar with virtual environments, you read this first. Now let’s open a new terminal window and we’ll:
- Create a new folder
- Navigate to the folder
- Create a new virtual environment
- Active the virtual environment
~ mkdir headless_scraper
~ cd headless_scraper
~ python3 -m venv env
~ source env/bin/activate
#2: Install dependencies
It’s pretty straightforward that we need Selenium and a web driver for our project. Lucaly, we can install both using Pyhton’s package manager, `pip`. Inside the same terminal window, type the following command:
~ pip install selenium webdriver-manager
Now you’re all set up for success! We can move on to some actual coding. Quick disclaimer, this article focuses on interacting with a headless browser. A complete web scraping solution requires much more effort. But I am sure if you follow our blog posts you can pull it off in little to no time.
#3: Open an automated browser
So far, we have the project, but there’s no file we can execute. Let’s create a new `.py` file and open it up in our IDE:
~ touch headles_scraper.py
~ code .
You should now be inside Visual Studio Code, or your IDE. You can start by importing the necessary packages inside `handle_scraper.py`:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
The latter is a package that helps you easily manage web drivers for the different browsers supported by Selenium. You can read more about it here. Next, we want to create a new browser with selenium and open a website:
driver = webdriver.Chrome(ChromeDriverManager().install())
Run this file now and you’ll see that it works. But instead of using a Python headless browser, it opens up a headful Chrome window:
#4: Make in headless
We set up to build a resource-friendly web scraper. So ideally, we want to open a headless browser with Selenium. Fortunately, there is an easy method we can use to switch Selenium from headful to headless. We just need to make use of the Chrome web driver options. So let’s import `Options` and add two more lines of code:
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)
Run your script again. As you can see, this time no window pops up. But is it really working in the background? A quick way to visualize and check that is to take a screenshot using Selenium. Just add this line at the bottom of your script:
If everything went well, you should have the same image as me:
#5: Add scraping capabilities
What is a web scraper? Well, at its core, a web scraper is a program that calls an endpoint to a server and collects data from it. With websites, this data usually consists of HTML files. But certain servers nowadays also serve JSON objects. So let’s stick to this term: data. For the following section, let’s set our goals higher. Let’s use some object oriented programming! Our goals are:
- Create a Scraper class
- Add a method to extract raw data
- Add a method to extract data from a single element
- Add a method to extract data from elements of the same class
So we have three basic methods we want to build. Yet for learning purposes, these three methods are opening a path not only to web scraping, but to OOP with Python. And I think that’s pretty cool! Now let’s remove everything we coded and we’ll start from scratch:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.remote.webelement import WebElement
def __init__(self, headless: bool = True) -> None:
self.headless = headless
def setup_scraper(self) -> None:
self.options = Options()
self.options.headless = self.headless
self.driver = webdriver.Chrome(options=self.options)
def navigate(self, target) -> None:
self.driver.get(target) if target else print('[!] No target given. Please specify a URL.')
def extract_raw_data(self) -> str:
def extract_single_element(self, selector: str, selector_type: By = By.CSS_SELECTOR) -> WebElement:
return self.driver.find_element(selector_type, selector)
def extract_all_elements(self, selector: str, selector_type: By = By.CSS_SELECTOR) -> list[WebElement]:
return self.driver.find_elements(selector_type, selector)
I added type annotations to facilitate understanding, rather than performance. This way, you can actually visualize the app from an I/O perspective. Now, the methods are pretty much self explanatory. We’re not performing any type action on the data, we’re just returning it. If you want, that can be a starting point for you to build a complex scraper with a Python headless browser.
So far, executing the file does nothing. That’s because we’ve only declared our Scraper and its methods. We now have to use them. So let’s add the following bits of code:
# Initialize a new Scraper and navigate to a target
scraper = Scraper()
# Extract and print the entire HTML document
raw_data = scraper.extract_raw_data()
# Extract and print an element by its class name
single_element = scraper.extract_single_element('title', By.CLASS_NAME)
# Extract and print all elements belonging to a tag type
all_elements = scraper.extract_all_elements('a', By.TAG_NAME)
print([el.get_attribute('href') for el in all_elements])
And that’s it. If you run your script now, you'll be able to see some action happening. Again this is merely a prototype designed to get you started. If you want to learn more about how a Python headless browser can be used in web scraping, I challenge you to:
- Read the Selenium documentation
- Add more functionality to the script we built today
This way, you’ll get to both acquire knowledge and add a project to your portfolio.
What are the best alternatives to Python headless browsers?
Python is one of the most popular programming languages to build web scrapers. Yet it is not the only solution. Nor is it the best! In this section we will discuss alternatives to a Python headless browser. We’ll start with why look into alternative solutions and we’ll also look at specific examples.
The main reason why you would opt for an alternative to building a Python web scraper yourself is resources. A complete web scraping solution requires you to implement an IP rotation system, some evasion techniques, account for performance, and that’s just to name a few. So building a web scraper is not only expensive, but also time consuming. Not to mention that maintaining the infrastructure generates even more costs.
The second drawback to Python headless browser has to do with performance. While Python is a great language and it’s very user friendly, it is not specifically known for its speed. Unlike Java for example (which also has a Selenium package), Python is both dynamically typed and interpreted. These two features make it a lot slower when compared to other languages. Now that we have some general understanding, let’s be specific. Here are the top 5 alternatives to Selenium and the Python headless browser:
#1: Web Scraping API
If you want to address the first drawback we identified, then you need to look into third party scraping providers. And Web Scraping API features a complete scraping suite. Plus, our service is packed with features like:
- IP rotation system for both datacenter and residential proxies
- Stealth mode
- Captcha solvers
These three alone make it nearly impossible for a target to catch up on our scraper and block it. And then there are the scraping features. With Web Scraping API, you can extract data based on selectors, switching between device types, taking screenshots and many more. A full list of features can be found here.
Playwright is another web automation tool developed by Microsoft contributors. It’s popular mainly because it offers support for various languages and platforms. Their motto is actually ‘Any browser, any platform’. Their API can be accessed on any operating system and with any of the following languages:
These are the main alternatives to a Python headless browser. But there are also other tools available. ZombieJS or HtmlUnit are just two more from a list of many. I guess choosing one technology is both a matter of performance and personal preference. So I encourage you to test them all and pick your favorite.
Using a Python headless browser has its pros and cons. One one hand, you get to build a custom solution to which you can always add more features. On the other hand, development and maintenance can be quite expensive. And there’s also the issue of stealthiness. If you need a professional solution, I think it’s best you go with a third party provider. Otherwise, for learning purposes, I will always encourage you to play around with technologies.