Back to Blog
Guides
Mihnea-Octavian ManolacheLast updated on Apr 28, 202613 min read

Playwright Web Scraping: Guide for Python and Node.js

Playwright Web Scraping: Guide for Python and Node.js
TL;DR: Playwright gives you full browser automation for scraping JavaScript-heavy sites, with first-class support for both Python and Node.js. This guide walks you through installation, element extraction, proxy configuration, anti-detection, pagination, image downloads, and exporting data to CSV or JSON, all with side-by-side code examples in both languages.

If you have tried scraping a modern single-page application with a simple HTTP client, you already know the pain: the HTML you get back is an empty shell, and the data you want lives inside JavaScript that never executes. Playwright web scraping solves this by driving a real browser (Chromium, Firefox, or WebKit) programmatically, letting your script see exactly what a human visitor would see.

Playwright is an open-source browser automation framework maintained by Microsoft. Unlike older tools, it ships with built-in auto-waiting, network interception, and support for multiple browser engines out of the box. Whether you write Python or Node.js, the API surface is nearly identical, so you can pick whichever language fits your stack.

This guide covers everything you need to go from a blank terminal to production-ready playwright scraping scripts: setup, selectors, text and image extraction, pagination, request interception, proxy configuration, stealth techniques, error handling, and structured data export. Every technique includes code for both Python and Node.js.

What Is Playwright and Why Use It for Web Scraping?

Playwright is a browser automation library created by the team behind Puppeteer at Google, who later moved to Microsoft. It controls Chromium, Firefox, and WebKit through a single, unified API. For web scraping with Playwright, the key advantages boil down to a short list:

  • Multi-browser support. You are not locked into Chromium. Need to test rendering differences or rotate browser engines to reduce fingerprinting? Playwright handles all three engines with identical method calls.
  • Multi-language SDKs. Official bindings exist for Python, Node.js (JavaScript/TypeScript), Java, and .NET. This guide focuses on the two most popular choices for scraping: Python and Node.js.
  • Auto-waiting. Playwright automatically waits for elements to become actionable before interacting with them. No more manual sleep() calls to work around race conditions on dynamic pages.
  • Headless and headed modes. Run headless for speed in production, switch to headed for debugging, with a single boolean flag.
  • Network interception. You can block images, stylesheets, and tracking scripts to speed up page loads, or inspect API responses the page makes behind the scenes.
  • Browser contexts. Spin up isolated, cookie-separated sessions inside a single browser instance. This is cheaper than launching a new browser per task and is perfect for concurrent scraping.

These features make Playwright a strong fit for scraping dynamic, JavaScript-rendered pages where a plain HTTP request returns little useful markup.

Core Playwright Features for Scraping

Before writing your first script, it helps to understand which Playwright capabilities map directly to common scraping challenges.

Auto-waiting and smart assertions. When you call page.locator('.price').text_content(), Playwright waits for that element to exist in the DOM and be visible before returning a value. This eliminates the flaky timing issues that plague older automation tools.

Network interception with page.route(). You can intercept every outgoing request, block resource types you do not need (images, fonts, analytics), modify headers, or even return mock responses. Blocking unnecessary resources can cut page load times significantly, which matters when you are scraping thousands of pages.

Concurrent browser contexts. Instead of launching a separate browser process per scraping job, you can open multiple BrowserContext objects within a single browser. Each context has its own cookies, local storage, and cache, so sessions stay isolated without the memory overhead of extra browser processes.

Built-in stealth capabilities. Playwright's newer versions patch several common automation indicators out of the box, such as the navigator.webdriver flag. Combined with community stealth plugins, you can reduce your detection surface area considerably.

Request and response event hooks. Listen for network events with page.on('response', ...) to capture API payloads directly. Many modern SPAs fetch JSON from internal endpoints; intercepting those responses often gives you cleaner data than parsing the rendered DOM.

Tracing and debugging. The trace viewer lets you replay a scraping session frame by frame, including screenshots, DOM snapshots, and network logs. This is invaluable when a selector breaks and you need to understand what changed.

Setting Up Playwright for Python and Node.js

Getting Playwright running takes under two minutes in either language. Here is the setup for both.

Python:

# Create a virtual environment and install playwright
pip install playwright
playwright install  # downloads Chromium, Firefox, and WebKit binaries

Node.js:

# Initialize a project and install playwright
npm init -y
npm install playwright
npx playwright install  # downloads browser binaries

Both installation commands download browser binaries locally. These binaries are approximately 200-400 MB total, so plan accordingly if you are working in a CI/CD environment or a Docker container. You can also install a single browser engine (for example, playwright install chromium) to save disk space.

Minimum requirements: Python 3.8+ or Node.js 16+. Playwright manages its own browser binaries, so you do not need a system-level Chrome or Firefox installation.

Choosing Between Synchronous and Asynchronous APIs

Playwright's Python SDK offers both synchronous and asynchronous interfaces. The sync API is simpler and works well for single-threaded scripts that scrape pages sequentially.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com")
    print(page.title())
    browser.close()

The async API uses Python's asyncio and is the better choice when you need to scrape multiple pages concurrently. It lets you run several browser contexts or pages in parallel within a single event loop.

import asyncio
from playwright.async_api import async_playwright

async def scrape():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto("https://example.com")
        print(await page.title())
        await browser.close()

asyncio.run(scrape())

When to pick which: Use sync for quick scripts, prototyping, and sequential scraping jobs. Switch to async when you are scraping hundreds or thousands of pages and need concurrency without spawning multiple processes. In Node.js, the API is inherently promise-based (async), so there is no separate sync mode to choose from.

Writing Your First Playwright Scraping Script

Let's build a minimal scraper that navigates to a page, extracts the <h1> text, and prints it. This illustrates the core workflow you will use in every playwright web scraping project.

Python (sync):

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://books.toscrape.com/")
    heading = page.locator("h1").text_content()
    print(f"Page heading: {heading}")
    browser.close()

Node.js:

const { chromium } = require('playwright');

(async () => {
    const browser = await chromium.launch({ headless: true });
    const page = await browser.newPage();
    await page.goto('https://books.toscrape.com/');
    const heading = await page.locator('h1').textContent();
    console.log(`Page heading: ${heading}`);
    await browser.close();
})();

The pattern is the same in both languages: launch a browser, create a page, navigate, locate an element, extract its content, and clean up. From here, everything scales by adding more locators, navigation steps, and data-handling logic.

A few things to note: headless: true (or headless=True in Python) runs the browser without a visible window. Set it to false/False when you need to watch the browser interact with the page for debugging. The locator() method is Playwright's recommended way to find elements; it returns a lazy reference that auto-waits and auto-retries.

Locating Elements with CSS and XPath Selectors

Finding the right elements on a page is the core skill of any web scraping with Playwright workflow. Playwright supports CSS selectors, XPath expressions, and its own text-based selectors.

CSS selectors are the most common choice. They are concise and performant:

# Python
titles = page.locator("article.product_pod h3 a")
for i in range(await titles.count()):
    print(await titles.nth(i).text_content())
// Node.js
const titles = page.locator('article.product_pod h3 a');
const count = await titles.count();
for (let i = 0; i < count; i++) {
    console.log(await titles.nth(i).textContent());
}

XPath selectors are useful when the DOM structure is complex or class names are dynamically generated:

# Python
price = page.locator("xpath=//p[@class='price_color']").first.text_content()

Text-based selectors let you find elements by their visible text, which can be more resilient than class names that change between deployments:

# Python
page.locator("text=Add to basket").click()

A practical tip: use the browser DevTools "Copy selector" feature as a starting point, but always simplify the generated selector. Auto-generated selectors tend to be brittle because they include deeply nested paths that break when the layout changes. Prefer short, specific selectors anchored to stable attributes like data-testid or semantic class names.

Extracting Text Data from Web Pages

With your selectors in place, the next step is pulling structured data from a page. Here is a practical playwright scraping tutorial that extracts book titles and prices from a demo bookstore site.

Python:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://books.toscrape.com/")

    books = page.locator("article.product_pod")
    results = []
    for i in range(books.count()):
        book = books.nth(i)
        title = book.locator("h3 a").get_attribute("title")
        price = book.locator(".price_color").text_content()
        results.append({"title": title, "price": price})

    for r in results:
        print(f"{r['title']}: {r['price']}")
    browser.close()

Node.js:

const { chromium } = require('playwright');

(async () => {
    const browser = await chromium.launch({ headless: true });
    const page = await browser.newPage();
    await page.goto('https://books.toscrape.com/');

    const books = page.locator('article.product_pod');
    const count = await books.count();
    const results = [];
    for (let i = 0; i < count; i++) {
        const book = books.nth(i);
        const title = await book.locator('h3 a').getAttribute('title');
        const price = await book.locator('.price_color').textContent();
        results.push({ title, price });
    }
    console.log(results);
    await browser.close();
})();

A few patterns worth highlighting: get_attribute("title") (Python) and getAttribute('title') (Node.js) pull attribute values, not inner text. This is useful for tooltip text, href values, and data attributes. The text_content() method returns the raw text inside an element, while inner_text() returns the rendered (visible) text. For scraping structured data, text_content() is usually what you want since it is faster and does not trigger a layout calculation.

Scraping and Downloading Images

Text is only part of the picture. Many scraping projects require downloading images, PDFs, or other binary files. Here is how to extract image URLs and save files locally using playwright web scraping.

Python:

import httpx
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://books.toscrape.com/")

    images = page.locator("article.product_pod img")
    base_url = "https://books.toscrape.com/"
    for i in range(images.count()):
        src = images.nth(i).get_attribute("src")
        full_url = base_url + src
        response = httpx.get(full_url)
        with open(f"image_{i}.jpg", "wb") as f:
            f.write(response.content)
        print(f"Saved image_{i}.jpg")
    browser.close()

Node.js:

const { chromium } = require('playwright');
const fs = require('fs');

(async () => {
    const browser = await chromium.launch({ headless: true });
    const page = await browser.newPage();
    await page.goto('https://books.toscrape.com/');

    const images = page.locator('article.product_pod img');
    const count = await images.count();
    const baseUrl = 'https://books.toscrape.com/';
    for (let i = 0; i < count; i++) {
        const src = await images.nth(i).getAttribute('src');
        const response = await page.request.get(baseUrl + src);
        fs.writeFileSync(`image_${i}.jpg`, await response.body());
        console.log(`Saved image_${i}.jpg`);
    }
    await browser.close();
})();

The Node.js version uses Playwright's built-in page.request API context, which shares cookies and headers with the page session. In Python, an external HTTP library (like httpx or requests) works well. If the images are behind authentication, use Playwright's own request context to carry session cookies automatically.

Handling Pagination and Infinite Scroll

Real-world scraping rarely fits on a single page. You will encounter numbered pagination, "Load More" buttons, and infinite scroll patterns. Here are reusable approaches for each.

Numbered pagination (click "Next" until it disappears):

# Python
results = []
while True:
    # Extract data from current page
    items = page.locator(".product_pod h3 a")
    for i in range(items.count()):
        results.append(items.nth(i).get_attribute("title"))

    next_btn = page.locator("li.next a")
    if next_btn.count() == 0:
        break
    next_btn.click()
    page.wait_for_load_state("networkidle")

print(f"Collected {len(results)} items across all pages")

Infinite scroll (scroll down until no new content loads):

// Node.js
let previousHeight = 0;
while (true) {
    await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
    await page.waitForTimeout(2000); // allow content to load
    const currentHeight = await page.evaluate('document.body.scrollHeight');
    if (currentHeight === previousHeight) break;
    previousHeight = currentHeight;
}
// Now extract all loaded items
const items = await page.locator('.item').allTextContents();

The infinite scroll pattern compares page height before and after scrolling. When the height stops changing, you have reached the bottom. Adjust the timeout based on how quickly the target site loads new content. For "Load More" buttons, the approach is similar to numbered pagination: locate the button, click it, wait for new content, and repeat.

Intercepting and Modifying HTTP Requests

One of Playwright's most powerful features for scraping is the ability to intercept network requests with page.route(). This lets you block unnecessary resources, modify headers, or capture API responses directly.

Blocking images and stylesheets for faster scraping:

# Python
def block_resources(route):
    if route.request.resource_type in ["image", "stylesheet", "font"]:
        route.abort()
    else:
        route.continue_()

page.route("**/*", block_resources)
page.goto("https://example.com")
// Node.js
await page.route('**/*', (route) => {
    const type = route.request().resourceType();
    if (['image', 'stylesheet', 'font'].includes(type)) {
        return route.abort();
    }
    return route.continue();
});
await page.goto('https://example.com');

Blocking images, fonts, and CSS can reduce page load times by 40-60% on media-heavy sites, which adds up fast when you are scraping at scale.

Intercepting API responses: Many SPAs fetch data from internal REST or GraphQL endpoints. Instead of parsing the rendered DOM, you can listen for those responses and grab the raw JSON:

# Python
def capture_api(response):
    if "/api/products" in response.url:
        data = response.json()
        print(f"Captured {len(data['items'])} products from API")

page.on("response", capture_api)
page.goto("https://example-spa.com/products")

This technique often yields cleaner, more structured data than DOM scraping, and it is more resilient to layout changes. As the Playwright documentation on network events explains, you can filter by URL patterns, response status codes, and content types.

Configuring Proxies in Playwright

When scraping at any meaningful volume, you will need proxies to distribute requests across different IP addresses. Playwright supports proxy configuration at the browser level or per-context.

Browser-level proxy:

# Python
browser = p.chromium.launch(
    headless=True,
    proxy={"server": "http://proxy-host:8080"}
)

Authenticated proxy:

// Node.js
const browser = await chromium.launch({
    proxy: {
        server: 'http://proxy-host:8080',
        username: 'user',
        password: 'pass'
    }
});

Context-level proxy (for rotation): A practical pattern for rotating proxies is to create a new browser context for each proxy, allowing you to cycle through a list of proxy addresses:

# Python
proxies = ["http://proxy1:8080", "http://proxy2:8080", "http://proxy3:8080"]
for i, url in enumerate(urls_to_scrape):
    proxy = proxies[i % len(proxies)]
    context = browser.new_context(proxy={"server": proxy})
    page = context.new_page()
    page.goto(url)
    # ... scrape data ...
    context.close()

For production workloads, consider a proxy provider that handles rotation and session management for you, so your scraping code stays focused on data extraction rather than infrastructure.

Anti-Detection Techniques and Stealth Configuration

Websites use various techniques to detect and block automated browsers. Effective playwright headless scraping requires minimizing your automation fingerprint.

User-agent rotation is the simplest starting point. Set a realistic user-agent string via the browser context:

# Python
context = browser.new_context(
    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
)

Stealth plugins patch common automation leaks. In Python, the playwright-stealth package applies a set of evasion techniques (hiding navigator.webdriver, spoofing WebGL vendor strings, randomizing plugin arrays):

# Python
from playwright_stealth import stealth_sync

page = browser.new_page()
stealth_sync(page)
page.goto("https://example.com")

In Node.js, the playwright-extra and puppeteer-extra-plugin-stealth packages provide similar functionality.

Behavioral mimicry adds another layer of protection. Instead of instantly jumping to your target element, add small random delays, scroll the page, and move the mouse. These micro-behaviors make your traffic pattern look more human:

import random, time
page.mouse.move(random.randint(100, 500), random.randint(100, 500))
time.sleep(random.uniform(0.5, 2.0))

Viewport and locale randomization. Setting a realistic viewport size, timezone, and locale on each context further reduces your fingerprint. Avoid using the default 800x600 viewport that many automation scripts ship with.

No single technique is a silver bullet. Effective anti-detection combines stealth plugins, proxy rotation, realistic browser fingerprints, and human-like browsing patterns.

Error Handling, Retries, and Data Export

Production scrapers need to handle the unexpected: timeouts, navigation failures, missing elements, and rate limiting. Here is how to build resilience into your playwright web scraping scripts, plus how to save your results.

Timeout handling:

# Python
from playwright.sync_api import TimeoutError as PlaywrightTimeout

try:
    page.goto("https://example.com", timeout=15000)
    title = page.locator("h1").text_content(timeout=5000)
except PlaywrightTimeout:
    print("Page or element load timed out")

Exponential backoff for retries:

import time

def scrape_with_retry(page, url, max_retries=3):
    for attempt in range(max_retries):
        try:
            page.goto(url, timeout=15000)
            return page.locator("h1").text_content()
        except Exception as e:
            wait = 2 ** attempt
            print(f"Attempt {attempt+1} failed: {e}. Retrying in {wait}s")
            time.sleep(wait)
    return None

Exporting to CSV:

import csv

with open("results.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["title", "price"])
    writer.writeheader()
    writer.writerows(results)

Exporting to JSON:

// Node.js
const fs = require('fs');
fs.writeFileSync('results.json', JSON.stringify(results, null, 2));

For larger scraping projects, consider writing results to a database (SQLite for local work, PostgreSQL for production) instead of flat files. This lets you resume interrupted jobs, deduplicate records, and query your data without loading everything into memory.

Playwright vs Puppeteer vs Selenium for Web Scraping

If you are choosing a browser automation tool for scraping, these three are the main contenders. Here is how they compare on the dimensions that matter most for data extraction.

Feature

Playwright

Puppeteer

Selenium

Browser engines

Chromium, Firefox, WebKit

Chromium (Firefox experimental)

Chrome, Firefox, Edge, Safari

Language support

Python, Node.js, Java, .NET

Node.js (community Python port)

Python, Java, C#, Ruby, JS

Auto-waiting

Built-in

Manual waits needed

Manual waits needed

Network interception

Full (page.route())

Full (page.setRequestInterception())

Limited (requires proxy tools)

Parallel contexts

Native browser contexts

Incognito contexts

Separate WebDriver instances

Headless mode

Built-in, all browsers

Built-in, Chromium only

Depends on browser driver

Community size

Growing rapidly

Large, mature

Largest, most established

Stealth ecosystem

playwright-stealth, playwright-extra

puppeteer-extra-plugin-stealth

Limited built-in options

When to choose Playwright: You want multi-browser support, your team uses Python or Node.js, and you need built-in auto-waiting and network interception without extra libraries. It is the most modern option and is well suited for scraping JavaScript-heavy applications.

When to choose Puppeteer: Your stack is Node.js only and you are targeting Chromium exclusively. Puppeteer's ecosystem is mature, and many existing scraping recipes are written for it.

When to choose Selenium: You need to support legacy browser testing alongside scraping, or your organization already has a Selenium infrastructure. Its language support is the broadest, but it requires more boilerplate for modern scraping tasks.

Key Takeaways

  • Playwright handles dynamic content natively. Its auto-waiting, network interception, and multi-browser support make it a strong choice for scraping JavaScript-rendered sites that simpler HTTP clients cannot handle.
  • Use both Python and Node.js. Playwright's API is nearly identical across languages. Pick whichever fits your existing stack, and switch freely when a project calls for it.
  • Block unnecessary resources. Intercepting and aborting image, font, and stylesheet requests can cut scraping times significantly, especially at scale.
  • Layer your anti-detection. Stealth plugins alone are not enough. Combine them with proxy rotation, user-agent randomization, viewport settings, and human-like delays for reliable access.
  • Build for failure. Wrap every navigation and extraction call in error handling with exponential backoff. Export data incrementally to avoid losing progress on long runs.

FAQ

Is Playwright better than Selenium for web scraping?

For scraping modern, JavaScript-heavy sites, Playwright generally offers a smoother experience. It includes built-in auto-waiting, native network interception, and multi-browser support without requiring extra drivers. Selenium has a broader language ecosystem and a larger community, but it requires more boilerplate code and third-party tools to match Playwright's scraping-specific features.

Can Playwright be detected by anti-bot systems?

Yes. Out of the box, headless Playwright exposes several automation indicators that sophisticated anti-bot services can detect, including the navigator.webdriver property and specific browser fingerprint anomalies. Stealth plugins reduce this surface area, but no browser automation tool is fully undetectable. Sites that deploy advanced bot protection may still flag automated sessions based on behavioral analysis.

Does Playwright support headless and headed browser modes?

Yes. Pass headless=True (Python) or headless: true (Node.js) when launching the browser for headless operation. Set the value to false to open a visible browser window. Headless mode is faster and uses less memory, making it the default for production scraping. Headed mode is primarily useful during development and debugging.

How do you handle CAPTCHAs when scraping with Playwright?

CAPTCHAs are designed to block automation, and there is no reliable programmatic solution built into any browser automation tool. Common approaches include using third-party CAPTCHA-solving services that integrate via API callbacks, rotating residential proxies to reduce CAPTCHA trigger rates, and slowing request frequency to stay below detection thresholds. For high-volume scraping, a managed scraping service that handles CAPTCHAs transparently is often the most practical path.

What is the best programming language to use with Playwright for scraping?

Python and Node.js are the two most popular choices, and both have excellent Playwright support. Python is favored when your pipeline involves data analysis libraries like pandas or when your team already works in Python. Node.js is a natural fit if your stack is JavaScript-centric or you want to reuse front-end selectors. Performance differences between the two bindings are negligible since the actual browser automation runs in the same underlying engine regardless of the calling language.

Conclusion

Playwright web scraping gives you a modern, well-supported toolkit for extracting data from the kinds of sites that defeat simpler approaches. From auto-waiting and network interception to concurrent browser contexts and built-in stealth, it handles the hard parts of browser automation so you can focus on the data.

The techniques covered here, including selector strategies, request interception, proxy configuration, anti-detection, pagination, and structured export, form the foundation of any serious scraping project. Start with a simple script, layer in error handling and retries, and scale up with async contexts as your workload grows.

If you reach the point where managing proxies, CAPTCHAs, and browser infrastructure is eating more engineering time than the actual data extraction, consider offloading that layer to a dedicated service. Our Scraper API handles proxy rotation, CAPTCHA solving, and anti-bot bypasses behind a single endpoint, so you can keep your Playwright parsing logic and stop worrying about the request pipeline.

About the Author
Mihnea-Octavian Manolache, Full Stack Developer @ WebScrapingAPI
Mihnea-Octavian ManolacheFull Stack Developer

Mihnea-Octavian Manolache is a Full Stack and DevOps Engineer at WebScrapingAPI, building product features and maintaining the infrastructure that keeps the platform running smoothly.

Start Building

Ready to Scale Your Data Collection?

Join 2,000+ companies using WebScrapingAPI to extract web data at enterprise scale with zero infrastructure overhead.