Back to Blog
Guides
Mihnea-Octavian ManolacheLast updated on May 12, 202610 min read

How to Build a Web Scraper with Pyppeteer (2026 Guide)

How to Build a Web Scraper with Pyppeteer (2026 Guide)
TL;DR: Pyppeteer is the unofficial Python port of Puppeteer and still works for driving a real Chromium from asyncio. In this guide you will install it, write a modern web scraper with Pyppeteer using asyncio.run and try/finally, handle waits, forms, screenshots, infinite scroll, cookies, and proxies, and learn when to migrate to Playwright, Selenium, or a hosted scraping API.

If you have outgrown requests plus BeautifulSoup because the data you need only appears after JavaScript runs, you have probably already looked at building a web scraper with Pyppeteer. Pyppeteer is the Python port of Puppeteer, and it lets you launch a real Chromium instance, wait for selectors, click buttons, and run arbitrary JavaScript inside the page from async Python code. That is enough to scrape single-page apps, infinite-scroll feeds, search UIs, and anything else that hides behind a fetch call.

This guide is written for intermediate Python developers in 2026. We will cover an honest status check on the project, a comparison with Selenium, Playwright, and Node Puppeteer, modern async patterns (asyncio.run, try/finally, structured waits), and a full end-to-end example that loops over multiple keywords on a JavaScript-driven search UI. By the end, you will have a working Pyppeteer scraper template plus a clear decision framework for when Pyppeteer is the right tool and when it is not.

Pyppeteer in 2026: where it fits and what changed

Pyppeteer is, at its core, a Python wrapper that mirrors Puppeteer's API: launch a browser, open a page, call waitForSelector, run evaluate, repeat. The mental model maps one-to-one with the original Puppeteer project on GitHub, which is helpful if you have ever read a Node tutorial and wanted to stay in Python.

The honest caveat for 2026 is that Pyppeteer is only lightly maintained. The maintainers state on the project README that it is minimally maintained, and several newer Puppeteer features have never been ported across. That does not mean your scraper will break tomorrow, but it does mean you should not pick Pyppeteer for a long-running production system without weighing Playwright and a managed scraping API as alternatives. We will come back to that decision at the end.

Pyppeteer vs Selenium, Playwright, and Puppeteer

Before you commit, it helps to see Pyppeteer next to its closest alternatives. The table below is a quick cheat sheet so you can pick the right tool for your stack rather than defaulting to whatever shows up first on Google.

Tool

Language

Async model

Browsers

Stealth options

Maintenance

Pyppeteer

Python

Native asyncio

Chromium

Manual, no native plugin

Lightly maintained

Playwright (Python)

Python

Sync + asyncio

Chromium, Firefox, WebKit

Built-in stealth-friendly defaults

Actively developed by Microsoft

Selenium

Python (and others)

Sync (async via wrappers)

Chromium, Firefox, Edge, Safari

selenium-stealth, undetected drivers

Actively maintained, mature

Puppeteer (Node)

JavaScript / TypeScript

Native promises

Chromium, Firefox (experimental)

puppeteer-extra-plugin-stealth

Actively developed by Chrome team

Practical read: pick Puppeteer in Node if you want the freshest features, Playwright for new Python projects that need stable cross-browser scraping, Selenium when you must support Safari or legacy IE-style flows, and Pyppeteer when a small Python script or an existing codebase already speaks asyncio. For a wider comparison, see our roundups on Python headless browser libraries and Puppeteer alternatives.

Setting up Pyppeteer (Python, Chromium, and M1/M2 fixes)

Use Python 3.10 or newer and a virtual environment. With uv, the install is a one-liner:

uv init pyppeteer-demo && cd pyppeteer-demo
uv add pyppeteer
uv run pyppeteer-install   # downloads bundled Chromium

If you prefer plain pip, swap in python -m venv .venv && pip install pyppeteer && pyppeteer-install. On the first run Pyppeteer may pull a bundled Chromium build (somewhere around 150MB at the time of writing, so re-check the latest release notes before you ship). To skip that download and use a system Chromium, set PYPPETEER_SKIP_CHROMIUM_DOWNLOAD=1 and pass executablePath to launch:

# macOS:  /Applications/Google Chrome.app/Contents/MacOS/Google Chrome
# Linux:  /usr/bin/google-chrome
# Windows: C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe
await launch(executablePath='/usr/bin/google-chrome', headless=True)

M1/M2 Mac gotcha: Pyppeteer can be finicky on arm64. If Chromium refuses to launch or crashes immediately, re-run your terminal under Rosetta and the install usually completes cleanly.

Build a minimal web scraper with Pyppeteer: a modern template

Here is a reusable starter for a web scraper with Pyppeteer that uses asyncio.run, wraps the browser in try/finally, and hands the rendered HTML to BeautifulSoup. We will scrape quotes.toscrape.com/js/, a sandbox page that renders quotes via JavaScript, so plain HTTP clients see an empty <body>.

import asyncio
from bs4 import BeautifulSoup
from pyppeteer import launch

URL = 'https://quotes.toscrape.com/js/'

async def scrape() -> list[dict]:
    browser = await launch(headless=True, args=['--no-sandbox'])
    try:
        page = await browser.newPage()
        await page.goto(URL, {'waitUntil': 'networkidle2'})
        await page.waitForSelector('.quote')
        html = await page.content()
        soup = BeautifulSoup(html, 'html.parser')
        return [
            {
                'text': q.select_one('.text').get_text(strip=True),
                'author': q.select_one('.author').get_text(strip=True),
            }
            for q in soup.select('.quote')
        ]
    finally:
        await browser.close()

if __name__ == '__main__':
    for row in asyncio.run(scrape()):
        print(row)

Three things matter here. asyncio.run replaces the older get_event_loop().run_until_complete pattern that older tutorials still show. try/finally guarantees Chromium gets closed even when your code raises. And waitForSelector is the explicit synchronization point, not a fixed sleep that wastes time on fast pages and times out on slow ones.

Waiting for elements the right way

Pyppeteer ships several waits, and the choice matters. waitFor() is vague and error-prone because it waits for "something" to happen, while waitForSelector() is explicit and resolves only when the target node exists in the DOM. Reach for waitForNavigation after a submit, and use waitUntil='networkidle2' when the page issues background fetch calls. As a fallback you can call page.waitFor(5000) to pause for five seconds, but treat any fixed-time wait as the last resort because it is the single biggest source of flaky scrapers.

Clicking, typing, and submitting forms

For richer interactions, combine page.click, page.type (with a small delay to look more human), and page.keyboard.press. After submitting a form, await navigation in parallel with the click so the URL change is not missed:

await page.type('input[name="q"]', 'pyppeteer', {'delay': 80})
await asyncio.gather(
    page.waitForNavigation({'waitUntil': 'networkidle2'}),
    page.keyboard.press('Enter'),
)

That pattern works for login forms, search bars, and any UI where a POST triggers a redirect.

Screenshots and PDF exports

page.screenshot() captures the visible viewport by default. Pass fullPage=True for top-to-bottom captures, and call setViewport first if you want a specific resolution. PDFs come from page.pdf() and work best on pages with clean print styles:

await page.setViewport({'width': 1440, 'height': 900})
await page.screenshot({'path': 'page.png', 'fullPage': True})
await page.pdf({
    'path': 'page.pdf',
    'format': 'A4',
    'printBackground': True,
    'margin': {'top': '20mm', 'bottom': '20mm', 'left': '15mm', 'right': '15mm'},
})

Handling infinite scroll and lazy loading

Infinite-scroll pages only render their next batch when you push the viewport toward the bottom. Use page.evaluate to run a small JS loop that watches document.body.scrollHeight and stops when it stops growing:

await page.evaluate('''async () => {
  await new Promise(resolve => {
    let last = 0;
    const timer = setInterval(() => {
      window.scrollBy(0, 800);
      const h = document.body.scrollHeight;
      if (h === last) { clearInterval(timer); resolve(); }
      last = h;
    }, 400);
  });
}''')

Cap the loop with a max iteration count if the feed is truly endless.

Managing cookies, sessions, and user agents

For logged-in pages, log in once, dump cookies, and replay them on the next run so you do not trigger fresh-login challenges:

cookies = await page.cookies()          # save somewhere safe
await page.setCookie(*saved_cookies)    # restore later
await page.setUserAgent('Mozilla/5.0 ... Chrome/124 Safari/537.36')
await page.setViewport({'width': 1366, 'height': 768})

Pair setUserAgent with a matching setViewport so the device fingerprint stays internally consistent. A desktop UA with a 320-pixel viewport is a classic giveaway for bot detection.

End-to-end web scraper with Pyppeteer: scraping a JavaScript-driven search UI

Let's pull everything together. The script below loops over multiple keywords, types each one into a search bar that renders results client-side, waits for the result cards to appear, extracts their titles with querySelectorAllEval, and clears the input before the next keyword. Swap the URL and selectors to fit your real target.

import asyncio
from pyppeteer import launch

KEYWORDS = ['python', 'pyppeteer', 'asyncio']
SEARCH_URL = 'https://example.com/search'   # JS-rendered UI

async def search_one(page, keyword: str) -> list[str]:
    await page.click('input[name="q"]', {'clickCount': 3})
    await page.keyboard.press('Backspace')
    await page.type('input[name="q"]', keyword, {'delay': 60})
    await page.keyboard.press('Enter')
    await page.waitForSelector('.result-card', {'timeout': 10000})
    return await page.querySelectorAllEval(
        '.result-card h3',
        '(nodes) => nodes.map(n => n.innerText.trim())',
    )

async def main():
    browser = await launch(headless=True, args=['--no-sandbox'])
    try:
        page = await browser.newPage()
        await page.goto(SEARCH_URL, {'waitUntil': 'networkidle2'})
        results = {}
        for kw in KEYWORDS:
            results[kw] = await search_one(page, kw)
            await asyncio.sleep(2)   # be polite
        return results
    finally:
        await browser.close()

if __name__ == '__main__':
    print(asyncio.run(main()))

Two upgrades this pattern unlocks. First, you reuse one browser and one page across all keywords, which keeps the run cheap. Second, the explicit waitForSelector makes the scraper resilient to network jitter, so the job does not collapse the moment a request takes 600ms instead of 200ms. From here, dropping in retries and concurrency with asyncio.gather is a natural next step.

Using proxies and rotation with Pyppeteer

Pyppeteer handles browser automation beautifully but does not manage proxies on its own, so you wire them in at launch. The --proxy-server Chromium flag accepts a single endpoint, and page.authenticate adds credentials before the first request:

import random
from pyppeteer import launch

PROXIES = [
    'http://user:pass@proxy-a.example.com:8000',
    'http://user:pass@proxy-b.example.com:8000',
    'http://user:pass@proxy-c.example.com:8000',
]

async def launch_with_proxy():
    proxy = random.choice(PROXIES)   # naive rotation
    host = proxy.split('@')[-1]
    browser = await launch(args=[f'--proxy-server=http://{host}'])
    page = await browser.newPage()
    await page.authenticate({'username': 'user', 'password': 'pass'})
    return browser, page

Even with a clean proxy you will hit rate limits eventually, so rotate per-session or per-keyword. For a deeper pattern, see our Python proxy rotation guide. If managing pools yourself feels like yak-shaving, a managed residential proxy product, or a request-level API like the WebScrapingAPI Scraper API, will offload that work.

Stealth and fingerprint hygiene checklist

Pyppeteer ships without a native stealth plugin, so you have to harden the browser yourself. The minimum viable checklist:

  • Set a realistic, current desktop user agent with page.setUserAgent.
  • Match it with a plausible viewport via page.setViewport (1366x768 or 1440x900 are safe defaults).
  • Patch the navigator.webdriver flag in an evaluateOnNewDocument hook so it returns undefined instead of true.
  • Keep cookie hygiene clean: clear cookies between sessions, or rotate sessions when you rotate IPs.
  • Rotate IPs through residential or mobile proxies for any target with serious bot defenses.
  • Throttle requests and humanize timing with delay on type and small asyncio.sleep gaps between actions.

Production-grade best practices for 2026

If you want a web scraper with Pyppeteer that survives a real schedule, codify these rules:

  • Run the entrypoint with asyncio.run(main()). Forget get_event_loop() and loop.run_until_complete(); the modern function is cleaner and less buggy.
  • Wrap every browser in try/finally so the Chromium process is killed even when your code raises. Leaked browsers are the number one cause of dead CI runners.
  • Prefer waitForSelector (explicit) over waitFor (vague). Reserve fixed sleeps for documented anti-bot delays only.
  • Throttle politely. Respect robots.txt, scope to public data, and add jitter so 100 requests do not arrive in 100 milliseconds.
  • Add structured logging (one JSON line per page) and capture the URL, status, response time, and selector match counts. You will thank yourself the first time a target site changes its HTML.

When Pyppeteer is the wrong tool (and what to use instead)

Pyppeteer is a great fit for one-off scripts, internal automation, and small Python codebases that already use asyncio. It starts to show its age once you need cross-browser coverage, fresh CDP features, official stealth, or large-scale concurrency. Use this rough decision rule:

  • Stay with Pyppeteer for prototypes, weekend scrapers, and scripts under a few hundred pages a day.
  • Move to Playwright (Python) when you need Firefox or WebKit, robust auto-waiting, or first-class tracing.
  • Move to Selenium if you must support Safari or hook into an existing test grid.
  • Use a hosted scraping API when you are spending more time on proxy rotation, CAPTCHAs, and headless infrastructure than on actual data.

Key Takeaways

  • Pyppeteer is the lightly maintained Python port of Puppeteer; it still works in 2026 for asyncio-based scraping but is not the right pick for long-lived production systems without a backup plan.
  • Use asyncio.run, try/finally, and waitForSelector instead of the older event-loop and waitFor patterns shown in stale tutorials.
  • A complete web scraper with Pyppeteer covers waits, form input, screenshots, PDFs, infinite scroll, cookie reuse, and proxies, not just goto.
  • There is no native stealth plugin, so user agent, viewport, navigator.webdriver, cookie hygiene, and rotating IPs are your responsibility.
  • Pick Playwright, Selenium, or a managed scraping API the moment your scraper outgrows a single machine, a single browser, or a single proxy.

FAQ

Is Pyppeteer still maintained and safe to use in 2026?

Not really. The maintainers explicitly state on the project's GitHub README that Pyppeteer is minimally maintained, and newer Puppeteer features rarely get ported across. It still runs and still scrapes, but for a long-lived production system you should evaluate Playwright (Python) or a hosted scraping API as a more actively developed alternative before committing.

What is the difference between Pyppeteer and Puppeteer?

Puppeteer is the official Node.js library from the Chrome team for automating Chromium. Pyppeteer is an unofficial Python port that mirrors most of Puppeteer's API but uses asyncio instead of Promises. Pyppeteer typically lags behind Puppeteer on new features, and some Puppeteer APIs are missing entirely, so the ecosystems are similar in shape but not in surface area.

Should I pick Pyppeteer, Playwright, or Selenium for a new Python scraping project?

For a new project in 2026, default to Playwright in Python. It is actively developed, supports Chromium, Firefox, and WebKit, and ships with auto-waiting that removes a lot of flakiness. Choose Selenium if you need Safari or an existing test grid. Choose Pyppeteer only when you are extending a legacy script that already uses it.

Can Pyppeteer bypass Cloudflare, bot detection, or CAPTCHAs on its own?

No. Pyppeteer ships without a stealth plugin and has no built-in CAPTCHA solver. You can reduce your fingerprint manually by setting a realistic user agent, patching navigator.webdriver, and rotating residential IPs, but defeating modern Cloudflare or hCaptcha challenges reliably usually means a hardened framework or a request-level scraping API that handles unblocking for you.

Why does my Pyppeteer script crash on M1 or M2 Macs?

The bundled Chromium is sensitive on Apple Silicon. The most common fix is to relaunch your terminal under Rosetta and rerun pyppeteer-install, which lets the x86_64 Chromium build install and start cleanly. As an alternative, set PYPPETEER_SKIP_CHROMIUM_DOWNLOAD=1 and point executablePath at an arm64-native Google Chrome you already have installed.

Wrapping up

Building a web scraper with Pyppeteer in 2026 is still a reasonable choice when you want a small, async-friendly Python script that drives a real Chromium. You have a working starter template, patterns for waits, forms, screenshots, infinite scroll, cookies, and proxies, plus a stealth checklist and a clear sense of when to graduate to Playwright, Selenium, or a managed alternative.

The honest bottom line: Pyppeteer's lightly maintained status means you should treat it as a tactical tool rather than a long-term platform. Wrap your browser in try/finally, prefer waitForSelector over fixed sleeps, and budget for a migration path the day a target site upgrades its bot defences faster than Pyppeteer ports the next CDP feature.

If proxy rotation, CAPTCHAs, or Chromium upgrades start eating more time than the scraping itself, hand the request layer off to WebScrapingAPI's Scraper API and keep your Pyppeteer code focused on parsing the data you actually care about.

About the Author
Mihnea-Octavian Manolache, Full Stack Developer @ WebScrapingAPI
Mihnea-Octavian ManolacheFull Stack Developer

Mihnea-Octavian Manolache is a Full Stack and DevOps Engineer at WebScrapingAPI, building product features and maintaining the infrastructure that keeps the platform running smoothly.

Start Building

Ready to Scale Your Data Collection?

Join 2,000+ companies using WebScrapingAPI to extract web data at enterprise scale with zero infrastructure overhead.