TL;DR: This XPath cheat sheet covers the syntax, predicates, axes, and functions you actually need for web scraping, plus a CSS-to-XPath translation table and runnable Puppeteer and Scrapy examples. Use it as a desk reference next time a CSS selector quietly breaks on a site you depend on.
Introduction
You came here because a selector broke, a layout changed, or someone on your team told you to "just use XPath" and you want it on one page. Fair enough. This XPath cheat sheet is built for that exact moment: a scraper developer mid-project who needs syntax, examples, and a few resilience rules without scrolling through a tutorial.
XPath, short for XML Path Language, is a query language that uses path-style expressions to navigate XML and XML-like documents, including HTML. In a scraping context that gives you something CSS selectors cannot: the ability to walk up the tree, match by visible text, and chain conditions like a small predicate logic. We will cover the core syntax, predicates, axes, the most useful functions, browser-side testing, a CSS-to-XPath translation table, and two short worked examples in Puppeteer and Scrapy that you can drop into a project today. By the end you should be able to read an HTML fragment, write a selector that survives small DOM changes, and verify it before you ship.
What XPath is and why it matters for web scraping
XPath is the XML Path Language: a query syntax that walks the document tree using path-style expressions to identify nodes. Browsers expose the HTML DOM as an XML-like tree, which is why XPath works for web scraping even though HTML is not strict XML. It is a core element of the XSLT standard and shows up across XML tooling and most scraping stacks.
So why do scraper developers reach for it? Three reasons. First, XPath can move both up and down the tree, while CSS selectors only go down and sideways. Second, it can match nodes by their visible text, which is invaluable when classes are randomized but labels are stable. Third, it lets you chain conditions in a single expression. This XPath cheat sheet covers every primitive you need to use those abilities in production.
How XPath thinks about a page: paths, nodes, and steps
An XPath expression is a path made of steps. Each step has an axis (which direction to look), a node test (what kind of node to match), and zero or more predicates (filters in square brackets). Most of the time the axis is implicit because child:: is the default, which is why short forms like //div/span work.
Take //div[@class='quote']/span[1]. The // is shorthand for the descendant-or-self:: axis, so this reads as: find any div whose class attribute equals quote, then take the first span child. There are three node types you care about: element nodes (div, span), attribute nodes (@class, @href), and text nodes accessed via text(). Predicates use 1-based indexing, so [1] is the first match, not the second.
Test XPath expressions live in your browser
Before you write a single line of scraper code, prove the expression works in the browser. Two workflows cover the cheat sheet's testing layer.
Ctrl+F (Cmd+F) inside the Elements panel. Open DevTools in Chrome or Firefox, click the Elements tab, then press Ctrl+F or Cmd+F. The search bar accepts CSS selectors and XPath expressions, and the page highlights every match in the DOM tree along with a count like "3 of 12". This is the fastest way to test selectivity.
The $x() console helper. Switch to the Console tab and run $x("//div[@class='quote']"). The helper returns matches as a JavaScript array, so you can probe deeper with $x("//a/@href").length. According to current vendor documentation, $x() ships with Chromium-based DevTools and Firefox DevTools, though you should sanity-check it in your target browser version. As one engineering write-up put it, this is "a great exercise to test your expressions before spending time on your code editor and without putting any stress on the site's server."
For a deeper reference on the language itself, the MDN XPath documentation is the most reliable starting point.
XPath vs. CSS selectors: a quick decision framework
CSS selectors are usually shorter and easier to read, which is why many developers consider CSS the default and reach for XPath only when CSS cannot express the query. That framing is opinion, not benchmark, but it maps to how most production codebases evolve. The honest rule is: pick the simplest selector that survives a small DOM change.
Use this table as a tiebreaker for our XPath vs CSS selectors discussion:
|
Situation |
Pick |
|---|---|
|
Stable |
CSS |
|
Need to traverse to a parent or ancestor |
XPath |
|
Match by visible text content |
XPath |
|
|
Either, CSS is shorter |
|
Multiple conditions on one element |
XPath (predicate chains) |
|
Quick one-liner for a known structure |
CSS |
|
Deeply nested or irregular markup |
XPath |
CSS selectors do not have a parent:: direction and they cannot filter on text, so any "click the row that contains 'Active'" pattern is naturally XPath territory.
Core XPath cheat sheet: syntax essentials
These are the building blocks every other section of this XPath cheat sheet composes on top of. An XPath expression at minimum has a tag name, optionally an attribute, and optionally a value for that attribute. Predicates and functions are layered on top.
|
Symbol |
Meaning |
Example |
Reads as |
|---|---|---|---|
|
|
Select from the root node |
|
The |
|
|
Descendant-or-self shorthand |
|
Every |
|
|
Current node |
|
A |
|
|
Parent of the current node |
|
The parent element of any |
|
|
Attribute selector |
|
The |
|
|
Any element node |
|
Any element child of any |
|
|
Tag filtered by attribute equality |
|
Every |
|
`expr1 \ |
expr2` |
Union of two expressions |
`//h1 \ |
//h2` |
All |
A useful mental model: / anchors you somewhere, the tag name names the step, the predicate filters the step. Most resilient XPath syntax in real scrapers is a chain of attribute predicates separated by /, not a positional path from the root.
Predicates: filtering by index, attribute, and condition
Predicates are the bracketed conditions that turn a generic step into a precise one. They can be chained, and they evaluate left to right. Remember that XPath predicates are 1-based, so //li[3] is the third li, not the fourth.
Common patterns you will reuse:
|
Predicate |
What it does |
|---|---|
|
|
The n-th matching element. |
|
|
The final match. |
|
|
First three matches. |
|
|
Exact attribute equality. |
|
|
Substring match on an attribute. |
|
|
Prefix match on an attribute. |
|
|
Negate any predicate. |
|
|
Multiple conditions on one element. |
You can chain predicates: //div[@class='quote'][position()<=5] gives the first five quotes. For a deeper dive into the topic-level decision between predicate chains and CSS pseudo-classes, see our companion piece on XPath vs CSS selectors.
Axes: traversing up, down, and across the DOM
This is where XPath earns its place over CSS. An axis says which direction to walk from the context node, and the explicit axis::node-test form unlocks moves CSS cannot make. This page of the XPath cheat sheet treats XPath axes as a first-class section, with one scraping use per row.
|
Axis |
What it returns |
Scraping example |
|---|---|---|
|
|
Direct children (the default axis) |
|
|
|
All descendants |
|
|
|
The |
|
|
|
The immediate parent |
|
|
|
Every ancestor up to the root |
|
|
|
Same plus the node itself |
Handy when the match may be the wrapper or a child. |
|
|
Siblings after this node |
|
|
|
Siblings before this node |
|
|
|
Attributes of the node ( |
|
|
|
The current node |
Used in predicates: |
The two you will use constantly are ancestor:: and following-sibling::. following:: and preceding:: axes also exist but are rarely needed.
Wildcards and node tests
Wildcards loosen the node test on a step, which is useful when tags differ but structure does not.
|
Token |
Matches |
|---|---|
|
|
Any element node. |
|
|
Any attribute node. |
|
|
Any node, including text and comments. |
|
|
Text nodes only. |
|
|
HTML comments, occasionally useful for marker-driven scraping. |
The key distinction: //div/* skips text and comments, while //div/node() includes them. Reach for text() when you specifically want strings.
Essential XPath functions for scraping
Functions inside predicates are how you handle messy real-world HTML: stray whitespace, dynamic class names, case mismatches, and counts. These XPath functions cover roughly 95% of scraping cases.
String functions
|
Function |
Example |
Use case |
|---|---|---|
|
|
|
Match dynamic or compound class names. |
|
|
|
Filter internal product links. |
|
|
|
XPath 2.0 only. In XPath 1.0 use |
|
|
|
Strip surrounding whitespace before comparing text. |
|
|
|
Crude lowercase for case-insensitive matching. |
|
|
|
Extract a slice of an attribute value. |
Numeric and node functions
|
Function |
Example |
Use case |
|---|---|---|
|
|
|
Number of rows on a results page. |
|
|
|
Pick even-indexed nodes. |
|
|
|
The final item, regardless of length. |
|
|
|
Filter by tag dynamically. |
|
|
|
Skip disabled inputs. |
For attribute-only outputs, a function-style call like string(@href) ensures you get the value as a string, not the attribute node, which matters in some scrapers when chaining into a regex or trim step.
CSS-to-XPath quick translation table
If your team thinks in CSS because you came from BeautifulSoup or Cheerio, this table is a quick migration aid. The XPath cheat sheet equivalents are not always shorter, but they are always more expressive.
|
CSS |
XPath |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The concat(' ',normalize-space(@class),' ') pattern looks ugly but is the safest way to mimic CSS class semantics in XPath, since XPath 1.0 has no native "token in a space-separated list" operator. For a side-by-side primer organized by CSS feature, see our CSS Selectors Cheat Sheet.
Worked example: extracting data with Puppeteer and Scrapy
To prove the same expression travels across stacks, here are two short scrapers that hit quotes.toscrape.com and pull every quote and its author. We confirmed in DevTools that //div[@class='quote'] matches roughly ten quotes on the homepage at the time of writing, which lines up with what the rendered page shows.
Puppeteer (Node.js). Initialise the project with npm init -y and npm install puppeteer, then drop this into index.js. Note that recent Puppeteer releases have moved toward locator APIs, so confirm page.$x against the version in your package.json.
import puppeteer from 'puppeteer';
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://quotes.toscrape.com/');
const rows = await page.$x("//div[@class='quote']");
const out = [];
for (const row of rows) {
const text = await row.$eval("span[@class='text']", el => el.textContent);
const author = await row.$eval("small[@class='author']", el => el.textContent);
out.push({ text, author });
}
console.log(out);
await browser.close();Scrapy (Python). Inside a spider, response.xpath accepts the same expressions and exposes .get() and .getall() for single and multi-value extraction.
def parse(self, response):
for q in response.xpath("//div[@class='quote']"):
yield {
"text": q.xpath(".//span[@class='text']/text()").get(),
"author": q.xpath(".//small[@class='author']/text()").get(),
}According to community reports, Cheerio and Beautiful Soup do not support XPath directly, so Scrapy or lxml is the usual Python pairing and Puppeteer or Playwright is the JavaScript pairing. Verify against your installed versions before standardising on either. For a full project walkthrough, our Scrapy and Puppeteer guides cover setup, scheduling, and proxy wiring.
Tips for writing resilient XPath in production scrapers
Five rules close out the cheat sheet, learned the hard way from spiders that broke at 3 a.m.:
- Avoid absolute paths starting from
/html/body/.... They break on the first layout change a designer ships. - Prefer attribute predicates over positional indices whenever a stable
classordata-*attribute is available. - Wrap text comparisons in
normalize-space()so leading and trailing whitespace cannot quietly break equality checks. - Use
contains()for dynamic class names andorto handle known variants://a[contains(@class,'btn-primary') or contains(@class,'btn-cta')]. - Keep selectors descriptive.
//div[@class='quote']/span[@class='text']survives more redesigns than//span.
Key Takeaways
- XPath beats CSS when you need to walk up the tree, match by visible text, or chain multiple conditions in a single selector.
- The core syntax is small:
/,//,.,..,@, predicates, and the union operator cover most expressions in this XPath cheat sheet. - Axes are the feature that most justifies switching from CSS, and
ancestor::,following-sibling::, andpreceding-sibling::are the three you will reach for most. - Verify every expression in DevTools with Ctrl+F or
$x()before you wire it into a scraper. It is faster than redeploying a broken spider. - Production-safe XPath leans on attributes,
contains(), andnormalize-space(), and stays well away from hard-coded positional paths.
FAQ
Does XPath work the same way in Selenium, Puppeteer, Playwright, and Scrapy?
Mostly yes, with two caveats. All four engines accept XPath 1.0 expressions and return matching nodes, but the wrapper methods differ: Selenium uses find_element(By.XPATH, ...), Scrapy uses response.xpath(...), Playwright uses page.locator("xpath=..."), and Puppeteer historically used page.$x() though recent releases prefer locator APIs. Check your library version before copy-pasting code from older tutorials.
Why does my XPath work in Chrome DevTools but return nothing in my Python script?
Almost always because the page renders content with JavaScript and your script fetched the raw HTML, so the nodes the browser shows do not exist in the response. Confirm by viewing the page source with Ctrl+U rather than the rendered Elements tab. The fix is to use a headless browser like Playwright or to call a documented JSON endpoint the page is hitting.
How do I write a case-insensitive XPath match for text or attributes?
XPath 1.0 has no lower-case() function, so the common workaround uses translate() to fold characters: //a[translate(text(),'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')='shop']. If your engine supports XPath 2.0 or 3.1, lower-case() and the matches() regex function are much cleaner. Always test both branches in the browser first.
How can I scrape an element whose class name changes on every page load?
Anchor on something stable: an attribute like data-id or aria-label, a child element with a fixed tag and text, or a parent landmark. If only part of the class is stable, contains(@class,'product-card') works. When even that is dynamic, walk up to a stable ancestor and back down: //section[@aria-label='Results']//article[1] is more durable than any class-based selector.
How do I select an element based on the text of one of its children?
Use a predicate that filters by descendant text. For example, //tr[td[normalize-space()='Active']] returns table rows that contain a cell whose trimmed text equals Active. If you need the matching cell instead of the row, anchor on it directly: //td[normalize-space()='Active']. Wrapping comparisons in normalize-space() is what makes the match robust to whitespace.
Conclusion
A good XPath cheat sheet is less about exotic syntax and more about a small, repeatable toolkit you can apply under deadline pressure. Walk the tree with axes when CSS runs out of moves, anchor on attributes and visible text instead of positional paths, and verify every expression in DevTools before it lands in a scraper. If you keep this page open while you write selectors for the next week, the patterns will stick.
XPath solves the parsing problem, but it does not solve the fetch problem. Real scraping projects spend most of their wall-clock time fighting rate limits, fingerprinting, and rendered-vs-raw HTML mismatches, which is exactly what we built WebScrapingAPI to handle. Point our Scraper API at any URL, get clean HTML back, and parse it with the same XPath expressions you tested in DevTools. That way the only selector tuning you do is on the parser side, where it belongs.




