Back to Blog
Guides
Mihai MaximLast updated on May 13, 202613 min read

XPath Cheat Sheet for Web Scraping: Syntax, Axes, and Real Code

XPath Cheat Sheet for Web Scraping: Syntax, Axes, and Real Code
TL;DR: This XPath cheat sheet covers the syntax, predicates, axes, and functions you actually need for web scraping, plus a CSS-to-XPath translation table and runnable Puppeteer and Scrapy examples. Use it as a desk reference next time a CSS selector quietly breaks on a site you depend on.

Introduction

You came here because a selector broke, a layout changed, or someone on your team told you to "just use XPath" and you want it on one page. Fair enough. This XPath cheat sheet is built for that exact moment: a scraper developer mid-project who needs syntax, examples, and a few resilience rules without scrolling through a tutorial.

XPath, short for XML Path Language, is a query language that uses path-style expressions to navigate XML and XML-like documents, including HTML. In a scraping context that gives you something CSS selectors cannot: the ability to walk up the tree, match by visible text, and chain conditions like a small predicate logic. We will cover the core syntax, predicates, axes, the most useful functions, browser-side testing, a CSS-to-XPath translation table, and two short worked examples in Puppeteer and Scrapy that you can drop into a project today. By the end you should be able to read an HTML fragment, write a selector that survives small DOM changes, and verify it before you ship.

What XPath is and why it matters for web scraping

XPath is the XML Path Language: a query syntax that walks the document tree using path-style expressions to identify nodes. Browsers expose the HTML DOM as an XML-like tree, which is why XPath works for web scraping even though HTML is not strict XML. It is a core element of the XSLT standard and shows up across XML tooling and most scraping stacks.

So why do scraper developers reach for it? Three reasons. First, XPath can move both up and down the tree, while CSS selectors only go down and sideways. Second, it can match nodes by their visible text, which is invaluable when classes are randomized but labels are stable. Third, it lets you chain conditions in a single expression. This XPath cheat sheet covers every primitive you need to use those abilities in production.

How XPath thinks about a page: paths, nodes, and steps

An XPath expression is a path made of steps. Each step has an axis (which direction to look), a node test (what kind of node to match), and zero or more predicates (filters in square brackets). Most of the time the axis is implicit because child:: is the default, which is why short forms like //div/span work.

Take //div[@class='quote']/span[1]. The // is shorthand for the descendant-or-self:: axis, so this reads as: find any div whose class attribute equals quote, then take the first span child. There are three node types you care about: element nodes (div, span), attribute nodes (@class, @href), and text nodes accessed via text(). Predicates use 1-based indexing, so [1] is the first match, not the second.

Test XPath expressions live in your browser

Before you write a single line of scraper code, prove the expression works in the browser. Two workflows cover the cheat sheet's testing layer.

Ctrl+F (Cmd+F) inside the Elements panel. Open DevTools in Chrome or Firefox, click the Elements tab, then press Ctrl+F or Cmd+F. The search bar accepts CSS selectors and XPath expressions, and the page highlights every match in the DOM tree along with a count like "3 of 12". This is the fastest way to test selectivity.

The $x() console helper. Switch to the Console tab and run $x("//div[@class='quote']"). The helper returns matches as a JavaScript array, so you can probe deeper with $x("//a/@href").length. According to current vendor documentation, $x() ships with Chromium-based DevTools and Firefox DevTools, though you should sanity-check it in your target browser version. As one engineering write-up put it, this is "a great exercise to test your expressions before spending time on your code editor and without putting any stress on the site's server."

For a deeper reference on the language itself, the MDN XPath documentation is the most reliable starting point.

XPath vs. CSS selectors: a quick decision framework

CSS selectors are usually shorter and easier to read, which is why many developers consider CSS the default and reach for XPath only when CSS cannot express the query. That framing is opinion, not benchmark, but it maps to how most production codebases evolve. The honest rule is: pick the simplest selector that survives a small DOM change.

Use this table as a tiebreaker for our XPath vs CSS selectors discussion:

Situation

Pick

Stable id or unique class

CSS

Need to traverse to a parent or ancestor

XPath

Match by visible text content

XPath

:nth-of-type style positional pick

Either, CSS is shorter

Multiple conditions on one element

XPath (predicate chains)

Quick one-liner for a known structure

CSS

Deeply nested or irregular markup

XPath

CSS selectors do not have a parent:: direction and they cannot filter on text, so any "click the row that contains 'Active'" pattern is naturally XPath territory.

Core XPath cheat sheet: syntax essentials

These are the building blocks every other section of this XPath cheat sheet composes on top of. An XPath expression at minimum has a tag name, optionally an attribute, and optionally a value for that attribute. Predicates and functions are layered on top.

Symbol

Meaning

Example

Reads as

/

Select from the root node

/html/body

The body directly under the root html

//

Descendant-or-self shorthand

//a

Every a anywhere in the document

.

Current node

.//span

A span descended from the context node

..

Parent of the current node

//span/..

The parent element of any span

@

Attribute selector

//img/@src

The src attribute of every img

*

Any element node

//div/*

Any element child of any div

tag[@attr='value']

Tag filtered by attribute equality

//a[@rel='nofollow']

Every a whose rel equals nofollow

`expr1 \

expr2`

Union of two expressions

`//h1 \

//h2`

All h1 and h2 elements combined

A useful mental model: / anchors you somewhere, the tag name names the step, the predicate filters the step. Most resilient XPath syntax in real scrapers is a chain of attribute predicates separated by /, not a positional path from the root.

Predicates: filtering by index, attribute, and condition

Predicates are the bracketed conditions that turn a generic step into a precise one. They can be chained, and they evaluate left to right. Remember that XPath predicates are 1-based, so //li[3] is the third li, not the fourth.

Common patterns you will reuse:

Predicate

What it does

[n]

The n-th matching element. //tr[1] is the first row.

[last()]

The final match. //li[last()] is the last list item.

[position()<=3]

First three matches.

[@class='quote']

Exact attribute equality.

[contains(@class,'btn')]

Substring match on an attribute.

[starts-with(@href,'https')]

Prefix match on an attribute.

[not(@disabled)]

Negate any predicate.

[@type='text' and @name='q']

Multiple conditions on one element.

You can chain predicates: //div[@class='quote'][position()<=5] gives the first five quotes. For a deeper dive into the topic-level decision between predicate chains and CSS pseudo-classes, see our companion piece on XPath vs CSS selectors.

Axes: traversing up, down, and across the DOM

This is where XPath earns its place over CSS. An axis says which direction to walk from the context node, and the explicit axis::node-test form unlocks moves CSS cannot make. This page of the XPath cheat sheet treats XPath axes as a first-class section, with one scraping use per row.

Axis

What it returns

Scraping example

child::

Direct children (the default axis)

//ul/child::li lists the immediate li children.

descendant::

All descendants

//article/descendant::a gets every link in an article.

descendant-or-self::

The // shorthand

//section//h3 walks down to any h3 inside section.

parent::

The immediate parent

//span[@class='price']/parent::div returns the wrapper div.

ancestor::

Every ancestor up to the root

//a[@class='sku']/ancestor::tr jumps from a link to its row.

ancestor-or-self::

Same plus the node itself

Handy when the match may be the wrapper or a child.

following-sibling::

Siblings after this node

//dt[.='Price']/following-sibling::dd[1] reads a value.

preceding-sibling::

Siblings before this node

//h2[.='Specs']/preceding-sibling::p[1] grabs the paragraph above.

attribute::

Attributes of the node (@ shorthand)

//img/attribute::alt is the alt text of every image.

self::

The current node

Used in predicates: *[self::h2 or self::h3].

The two you will use constantly are ancestor:: and following-sibling::. following:: and preceding:: axes also exist but are rarely needed.

Wildcards and node tests

Wildcards loosen the node test on a step, which is useful when tags differ but structure does not.

Token

Matches

*

Any element node. //div/* returns every element child of any div.

@*

Any attribute node. //img/@* lists every attribute on every img.

node()

Any node, including text and comments. //div/node() returns the full child sequence.

text()

Text nodes only. //p/text() is paragraph text, excluding child elements.

comment()

HTML comments, occasionally useful for marker-driven scraping.

The key distinction: //div/* skips text and comments, while //div/node() includes them. Reach for text() when you specifically want strings.

Essential XPath functions for scraping

Functions inside predicates are how you handle messy real-world HTML: stray whitespace, dynamic class names, case mismatches, and counts. These XPath functions cover roughly 95% of scraping cases.

String functions

Function

Example

Use case

contains(s, sub)

//div[contains(@class,'product-card')]

Match dynamic or compound class names.

starts-with(s, prefix)

//a[starts-with(@href,'/product/')]

Filter internal product links.

ends-with(s, suffix)

//img[ends-with(@src,'.webp')]

XPath 2.0 only. In XPath 1.0 use substring.

normalize-space(s)

//h1[normalize-space()='New Arrivals']

Strip surrounding whitespace before comparing text.

translate(s, from, to)

//a[translate(text(),'ABC','abc')='shop']

Crude lowercase for case-insensitive matching.

substring(s, start, len)

substring(@data-id, 1, 4)

Extract a slice of an attribute value.

Numeric and node functions

Function

Example

Use case

count(nodes)

count(//tr)

Number of rows on a results page.

position()

[position() mod 2 = 0]

Pick even-indexed nodes.

last()

//li[last()]

The final item, regardless of length.

name()

[name()='figure']

Filter by tag dynamically.

not(expr)

//input[not(@disabled)]

Skip disabled inputs.

For attribute-only outputs, a function-style call like string(@href) ensures you get the value as a string, not the attribute node, which matters in some scrapers when chaining into a regex or trim step.

CSS-to-XPath quick translation table

If your team thinks in CSS because you came from BeautifulSoup or Cheerio, this table is a quick migration aid. The XPath cheat sheet equivalents are not always shorter, but they are always more expressive.

CSS

XPath

#main

//*[@id='main']

.btn

//*[contains(concat(' ',normalize-space(@class),' '),' btn ')]

div.btn

//div[contains(concat(' ',normalize-space(@class),' '),' btn ')]

a[href]

//a[@href]

a[href='/x']

//a[@href='/x']

a[href^='/']

//a[starts-with(@href,'/')]

ul > li

//ul/li

ul li

//ul//li

li:first-child

//li[1]

li:nth-of-type(2)

//li[2]

li:last-child

//li[last()]

h2 + p

//h2/following-sibling::p[1]

:not([disabled])

[not(@disabled)]

The concat(' ',normalize-space(@class),' ') pattern looks ugly but is the safest way to mimic CSS class semantics in XPath, since XPath 1.0 has no native "token in a space-separated list" operator. For a side-by-side primer organized by CSS feature, see our CSS Selectors Cheat Sheet.

Worked example: extracting data with Puppeteer and Scrapy

To prove the same expression travels across stacks, here are two short scrapers that hit quotes.toscrape.com and pull every quote and its author. We confirmed in DevTools that //div[@class='quote'] matches roughly ten quotes on the homepage at the time of writing, which lines up with what the rendered page shows.

Puppeteer (Node.js). Initialise the project with npm init -y and npm install puppeteer, then drop this into index.js. Note that recent Puppeteer releases have moved toward locator APIs, so confirm page.$x against the version in your package.json.

import puppeteer from 'puppeteer';

const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://quotes.toscrape.com/');

const rows = await page.$x("//div[@class='quote']");
const out = [];
for (const row of rows) {
  const text = await row.$eval("span[@class='text']", el => el.textContent);
  const author = await row.$eval("small[@class='author']", el => el.textContent);
  out.push({ text, author });
}
console.log(out);
await browser.close();

Scrapy (Python). Inside a spider, response.xpath accepts the same expressions and exposes .get() and .getall() for single and multi-value extraction.

def parse(self, response):
    for q in response.xpath("//div[@class='quote']"):
        yield {
            "text": q.xpath(".//span[@class='text']/text()").get(),
            "author": q.xpath(".//small[@class='author']/text()").get(),
        }

According to community reports, Cheerio and Beautiful Soup do not support XPath directly, so Scrapy or lxml is the usual Python pairing and Puppeteer or Playwright is the JavaScript pairing. Verify against your installed versions before standardising on either. For a full project walkthrough, our Scrapy and Puppeteer guides cover setup, scheduling, and proxy wiring.

Tips for writing resilient XPath in production scrapers

Five rules close out the cheat sheet, learned the hard way from spiders that broke at 3 a.m.:

  • Avoid absolute paths starting from /html/body/.... They break on the first layout change a designer ships.
  • Prefer attribute predicates over positional indices whenever a stable class or data-* attribute is available.
  • Wrap text comparisons in normalize-space() so leading and trailing whitespace cannot quietly break equality checks.
  • Use contains() for dynamic class names and or to handle known variants: //a[contains(@class,'btn-primary') or contains(@class,'btn-cta')].
  • Keep selectors descriptive. //div[@class='quote']/span[@class='text'] survives more redesigns than //span.

Key Takeaways

  • XPath beats CSS when you need to walk up the tree, match by visible text, or chain multiple conditions in a single selector.
  • The core syntax is small: /, //, ., .., @, predicates, and the union operator cover most expressions in this XPath cheat sheet.
  • Axes are the feature that most justifies switching from CSS, and ancestor::, following-sibling::, and preceding-sibling:: are the three you will reach for most.
  • Verify every expression in DevTools with Ctrl+F or $x() before you wire it into a scraper. It is faster than redeploying a broken spider.
  • Production-safe XPath leans on attributes, contains(), and normalize-space(), and stays well away from hard-coded positional paths.

FAQ

Does XPath work the same way in Selenium, Puppeteer, Playwright, and Scrapy?

Mostly yes, with two caveats. All four engines accept XPath 1.0 expressions and return matching nodes, but the wrapper methods differ: Selenium uses find_element(By.XPATH, ...), Scrapy uses response.xpath(...), Playwright uses page.locator("xpath=..."), and Puppeteer historically used page.$x() though recent releases prefer locator APIs. Check your library version before copy-pasting code from older tutorials.

Why does my XPath work in Chrome DevTools but return nothing in my Python script?

Almost always because the page renders content with JavaScript and your script fetched the raw HTML, so the nodes the browser shows do not exist in the response. Confirm by viewing the page source with Ctrl+U rather than the rendered Elements tab. The fix is to use a headless browser like Playwright or to call a documented JSON endpoint the page is hitting.

How do I write a case-insensitive XPath match for text or attributes?

XPath 1.0 has no lower-case() function, so the common workaround uses translate() to fold characters: //a[translate(text(),'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')='shop']. If your engine supports XPath 2.0 or 3.1, lower-case() and the matches() regex function are much cleaner. Always test both branches in the browser first.

How can I scrape an element whose class name changes on every page load?

Anchor on something stable: an attribute like data-id or aria-label, a child element with a fixed tag and text, or a parent landmark. If only part of the class is stable, contains(@class,'product-card') works. When even that is dynamic, walk up to a stable ancestor and back down: //section[@aria-label='Results']//article[1] is more durable than any class-based selector.

How do I select an element based on the text of one of its children?

Use a predicate that filters by descendant text. For example, //tr[td[normalize-space()='Active']] returns table rows that contain a cell whose trimmed text equals Active. If you need the matching cell instead of the row, anchor on it directly: //td[normalize-space()='Active']. Wrapping comparisons in normalize-space() is what makes the match robust to whitespace.

Conclusion

A good XPath cheat sheet is less about exotic syntax and more about a small, repeatable toolkit you can apply under deadline pressure. Walk the tree with axes when CSS runs out of moves, anchor on attributes and visible text instead of positional paths, and verify every expression in DevTools before it lands in a scraper. If you keep this page open while you write selectors for the next week, the patterns will stick.

XPath solves the parsing problem, but it does not solve the fetch problem. Real scraping projects spend most of their wall-clock time fighting rate limits, fingerprinting, and rendered-vs-raw HTML mismatches, which is exactly what we built WebScrapingAPI to handle. Point our Scraper API at any URL, get clean HTML back, and parse it with the same XPath expressions you tested in DevTools. That way the only selector tuning you do is on the parser side, where it belongs.

About the Author
Mihai Maxim, Full Stack Developer @ WebScrapingAPI
Mihai MaximFull Stack Developer

Mihai Maxim is a Full Stack Developer at WebScrapingAPI, contributing across the product and helping build reliable tools and features for the platform.

Start Building

Ready to Scale Your Data Collection?

Join 2,000+ companies using WebScrapingAPI to extract web data at enterprise scale with zero infrastructure overhead.