The Ultimate Guide To Building a Web Scraper With Pyppeteer

Mihnea-Octavian Manolache on Feb 28 2023

blog-image

When it comes to Python and web automation, Selenium was pretty much the go-to. At least up until now. Because of Puppeteer’s success in the JavaScript community, Python developers started to look more and more into it. And that is how Pyppeteer came to existence. But what exactly is Pyppeteer? And why should we choose it over Selenium? Is it reliable enough to build a complex solution using it? To all these questions and many more, we will answer in today’s article. My aim for today is, if you read through this material, you should leave with at least:

  • A definition of Pyppeteer and its use cases
  • An understanding of how Pyppeteer compares to Selenium
  • An actual implementation of a web scraper using Pyppeteer

So get yourself ready, because today we’ll talk and get our hands on some coding!

What is Pyppeteer really and how can you use it?

If you’re reading this, chances are you are already familiar with what web scripting is in general. And you’ve probably already heard of Puppeteer or Selenium, depending on your favorite programming language. But Pyppeteer is indeed newer to the scene of web scraping. Well, to cut it short, Pyppeteer is much more like Puppeteer than it is like Selenium.

Puppeteer is a Node.js library that facilitates control of a headless version of Chrome via the DevTools protocol. Pyppeteer is a Python port of Puppeteer. Just like the original Puppeteer, Pyppeteer is a library, written in Python, that basically automates a browser. In other words, Pyppeteer is a Python implementation of the Puppeteer API, which allows you to use Puppeteer's features in a Python environment. The main difference between the two is the language being used.

Pyppeteer terminology you should know

Before moving forward, I think we should discuss some terms commonly used in the context of Pyppeteer:

  • Headless: It means starting a browser without a graphical user interface (GUI). In other words, it's running "behind the scenes" and you can't see it on the screen. It’s usually used to reduce resource usage while scraping.
  • Headful: Conversely, a “headful” browser is one that is running with a GUI. This is the opposite of a headless browser and It is often used for testing, debugging, or interacting with web pages manually.
  • Browser Context: This is a state shared among all pages in a browser. It’s usually used to set browser-wide settings, such as cookies, HTTP headers, and geolocation.
  • DOM: The Document Object Model (DOM) is a programming interface for HTML and XML documents. It represents the structure of a web page in a tree-like format, with nodes that represent elements. Pyppeteer allows you to interact with the elements of a page by manipulating the DOM.
  • Elements: The building blocks of a web page. They are defined using tags, attributes, and values.

Of course, there’s more to it and you’ll learn some more along the way. But I wanted you to get a grasp of it so that we have a solid start. I’m satisfied that knowing these terms will help you better understand the essence of this article.

Why use Pyppeteer in your scraping project?

I think there are two aspects of this matter. The first one is why Pyppeteer is a good choice for web scraping in general. The second one is why use Pyppeteer over Selenium. Generally, some of the advantages of Pyppeteer account for:

  • Evaluating JavaScript: Pyppeteer provides a `page.evaluate()` function. It allows you to execute JavaScript code within the context of the page.
  • Network control: Pyppeteer provides a `page.on()` method. This allows you to listen for network events, such as requests and responses, that happen on a page.
  • Tracing and Logging: Pyppeteer allows you to trace the browser’s activity and log browser messages from a page. This makes it easy to debug, trace and understand what a website is doing.

Compared to Selenium, it is quite similar, in that both are used to automate a web browser. However, there are a few key differences and advantages that Pyppeteer has over Selenium:

  • Simplicity: Pyppeteer has a simpler and more consistent API than Selenium, which makes it easier to use for beginners. Pyppeteer API is built on top of the DevTools protocol which is close to the browser and it is easy to learn and use.
  • Performance: Pyppeteer can be faster than Selenium because it is built on top of the DevTools protocol. The protocol is designed for debugging web pages and it is much faster than Selenium WebDriver.
  • Better Network Control: Pyppeteer allows for greater control over the browser’s network settings such as request interception and request/response blocking. This makes it easier to test and diagnose network-related issues.

And of course there is also a matter of choice. Take me for example. On a day to day basis, I code in JavaScript. And I am quite familiar with Puppeteer. But my favorite programming language is Python on the other hand. So if I were to build a Scraper with a known technology in a language I prefer, I’d probably go with Pyppeteer.

And with that said, I think we covered the ‘talk’ aspects of this article. It’s time to start with some actual coding.

How to create a web scraper with Pyppeteer

Before we start coding, let me introduce you to the Pyppeteer official documentation. I am an advocate of using the official documentation whenever one feels stuck. That’s before asking questions in the community (like on Stackoverflow). I usually find that most answers can be found if you just read the docs first. So take this as a kind ask from me. Whenever you’re stuck, check the documentation, then search for answers and only ask questions as your last resort.

#1: Set up the environment

First things first, as a Python developer, you’re probably familiar with virtual environments. So the first thing we need to do is create a virtual environment for our project. This is generally the sequence of commands I use:

# Create a new directory and navigate into it

~ » mkdir py_project && cd py_project

# Create the virtual environment

~ » python3 -m venv env

# Activate the virtual environment

~ » source env/bin/activate

With regards to the virtual environment, you’re all set up now. It’s time to move on and install Pyppeteer. Since you’ve got your terminal open, just type:

# Install the package using pip

~ » python3 -m pip install pyppeteer

# Open the project in your IDE

~ » code .

#2: Create a simple Pyppeteer scraper

The last command opens Visual Studio Code or your preferred IDE. So now that you’re in the ‘development environment’, let’s create a new `.py` file that will hold our code. I’ll call my file `scraper.py`. Note that Pyppeteer natively supports asynchronous execution. So let’s import both `asyncio` and `pyppeteer` into our file:

import asyncio
from pyppeteer import launch

With this done, we can move ahead to more complicated code. In general, I am not the biggest advocate of functional programming. Yet, I think splitting code into small chunks allows for better learning. So let’s wrap our code inside a function:

async def scrape(url):

browser = await launch()

page = await browser.newPage()

await page.goto(url)

content = await page.content()

await browser.close()



return content

blog-image

This function takes a URL as an input and it launches a headless browser using pyppeteer. Then, it navigates to the provided URL, retrieves the page's content, and closes the browser. The value it returns is nothing else than the HTML collected from the page. You can use this function to scrape almost any website. To use the function, you would call it in an `asyncio` event loop, like this:

async def main():

content = await scrape('https://www.example.com')

print(content)

loop = asyncio.get_event_loop()

loop.run_until_complete(main())

#3: Add more functionality

Up to this point, we have a working scraper. But that is pretty much all we have. If you want to build a more advanced web scraper with Pyppeteer, you will have to add more functionality to id. Spoiler alert: we’ll dive into the world of object oriented programming. But first, let’s trace our objectives. What do we want our scraper to be able to do?

  1. Initialize the browser with some custom values
  2. Navigate and extract content from a web page
  3. Write text to an input field
  4. Extract a single element’s value
  5. Extract value from multiple elements

3.1. Custom options

So let’s create a new `Scraper` class for now and we’ll add its methods afterward:

class Scraper:

def __init__(self, launch_options: dict) -> None:

self.options = launch_options['options']

self.viewPort = launch_options['viewPort'] if 'viewPort' in launch_options else None

pass

The only argument we’re using for our Scraper is a `launch_options` dictionary. As you see, it holds two keys inside it. One key defines Pyppeteer’s launcher options. The second option is either `None` or a dictionary holding the `width` and `height` of the `viewPort`. The latter is used for this method.

3.2. Navigate to a page

If you look at the function we used before, you’ll see that we cover both navigating and extracting raw data from a specific URL. The only thing we need to do is to tweak and turn the function into a method for our Scraper:

async def goto(self, url: str) -> None:

self.browser = await launch(options=self.options)

self.page = await self.browser.newPage()

await self.page.setViewport(self.viewPort) if self.viewPort != None else print('[i] Using default viewport')

await self.page.goto(url)

This method is quite simple. First, it launches a new browser, with the custom options we set before. It then creates a new page and, if our `launch_options` dictionary has `viewPort`, then it sets the page’s viewPort. Otherwise, it logs a simple message. Last, but not least, it takes us to the target.

3.3. Extract raw data from a page

Again, we have the method in our initial `scraper` function. We will only await for the `page.content()` to load and return its value:

async def get_full_content(self) -> str:

content = await self.page.content()

return content

3.4. Write text to an input field

In order to write something to an input field using Pyppeteer, you’ll need two things. First, locate the element. Second, add some value to it. Luckily, Pyppeteer has methods for both these actions:

async def type_value(self, selector: str, value: str) -> None:

element = await self.page.querySelector(selector)

await element.type(value)

3.5. Extract value from page

Remember we want to be able to extract either the value from a single element or values from multiple elements. We could use one single method for both. But I usually like things separated. So for now, I’ll add two more methods:

async def extract_one(self, selector) -> str:

element = await self.page.querySelector(selector)

text = await element.getProperty("textContent")

return await text.jsonValue()

Here, we are locating the element using the `querySelector` method. We then await for the `textContent` and return its `jsonValue()`. On the other hand, when we want to select many elements, we’ll use `querySelector`:

async def extract_many(self, selector) -> list:

result = []

elements = await self.page.querySelectorAll(selector)

for element in elements:

text = await element.getProperty("textContent")

result.append(await text.jsonValue())

return result

This method works in a similar way to `extract_one`. The only difference is its return value. This time we’re returning a list of all text inside the selected elements. And I guess with this added, we touched on all our goals.

#4: Make it stealthy

In web scraping, stealthiness can be described as the ability to go undetected. Of course, building a fully undetectable scraper takes a lot of work. For example, Web Scraping API’s Stealth Mode is maintained by a dedicated team. And the effort put into it makes our scraper’s fingerprint unique with every request.

But my overall aim for this tutorial is to set you on the right path. And the right path for a complete web scraper with Pyppeteer implies adding some stealth functionality to it. Luckily enough, just like there is `puppeteer-extra-plugin-stealth` in Node, there is a package for Python too. And it's called intuitively, `pyppeteer-stealth`. To add it to your project, first, install it using pip:

~ » python3 -m pip install pyppeteer_stealth

Then import it into your project and just add one extra line of code:

async def goto(self, url: str) -> None:

self.browser = await launch(options=self.options)

self.page = await self.browser.newPage()

# Make it stealthy

await stealth(self.page)

await self.page.setViewport(self.viewPort) if self.viewPort != None else print('[i] Using default viewport')

await self.page.goto(url)

And here is how you run your scraper. I added some comments to the code to highlight what each step is doing:

async def main():

# Define the launch options dictionary

launch_options = {

'options': {

'headless': False,

'autoClose': True

},

'viewPort': {

'width': 1600,

'height': 900

}

}

# Initialize a new scraper

scraper = Scraper(launch_options)

# Navigae to your target

await scraper.goto('https://russmaxdesign.github.io/accessible-forms/accessible-name-input01.html')

# Type `This is me` inside the input field

await scraper.type_value(

'#fish',

'This is me')

# Scrape the entire page

content = await scraper.get_full_content()

print(content)

# Scrape one single element

el = await scraper.extract_one('body > div:nth-child(14) > ul')

print(el)

# Scrape multiple elements

els = await scraper.extract_many('p')

print(els)

loop = asyncio.get_event_loop()

loop.run_until_complete(main())

Conclusion

Pyppeteer is an amazing tool for web scraping. It ports the entire Puppeteer API to Python, making it possible for the Python community to use this technology without learning JavaScript. Moreover, I don’t think it’s a replacement for Selenium, but it sure is a good alternative to it.

I hope today’s article added value to your learning path. And as I like to push everyone's limits, I challenge you to add more to what you learned today. The scraper we built together is a really good starting point and introduces a key element in programming: OOP. So, I challenge you to add more methods to `Scraper` and make it truly amazing.

News and updates

Stay up-to-date with the latest web scraping guides and news by subscribing to our newsletter.

We care about the protection of your data. Read our Privacy Policy.

Related articles

thumbnail
GuidesSERP Scraping API - Start Guide

Effortlessly gather real-time data from search engines using the SERP Scraping API. Enhance market analysis, SEO, and topic research with ease. Get started today!

WebscrapingAPI
author avatar
WebscrapingAPI
7 min read
thumbnail
GuidesHow To Scrape Amazon Product Data: A Comprehensive Guide to Best Practices & Tools

Explore the complexities of scraping Amazon product data with our in-depth guide. From best practices and tools like Amazon Scraper API to legal considerations, learn how to navigate challenges, bypass CAPTCHAs, and efficiently extract valuable insights.

Suciu Dan
author avatar
Suciu Dan
15 min read
thumbnail
GuidesLearn How To Bypass Cloudflare Detection With The Best Selenium Browser

Learn what’s the best browser to bypass Cloudflare detection systems while web scraping with Selenium.

Mihnea-Octavian Manolache
author avatar
Mihnea-Octavian Manolache
9 min read