The ultimate XPath Cheat Sheet. How to easily write powerful selectors.
Mihai Maxim on Dec 16 2022
An XPath Cheat Sheet?
Did you ever need to write a CSS selector that is class-independent? If your answer is no, well, you can consider yourself lucky. If the answer is yes, then our XPath Cheat Sheet is what you need. The web is crawling with data. Entire businesses depend on putting some of it together to bring new services to the world. APIs are of great use, but not every website has open APIs. Sometimes, you'll have to get what you need the old way. You'll have to build a scraper for the website. Modern websites circumvent scraping by renaming their CSS classes. As a result, it is better to write selectors that rely on something more stable. In this article, you'll learn how to write selectors based on the DOM node layout of the page.
What is XPath and how do I try it?
XPath stands for XML Path Language. It uses a path notation (as in URLs) to provide a flexible way of pointing to any part of an XML document.
XPath is mainly used in XSLT, but can also be used as a much more powerful way of navigating through the DOM of any XML-like language document using XPathExpression, such as HTML and SVG, instead of relying on the Document.getElementById() or Document.querySelectorAll() methods, the Node.childNodes properties, and other DOM Core features. XPath | MDN (mozilla.org)
A path notation?
<title>Nothing to see here</title>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
<h2>My Second Heading</h2>
<p>My second paragraph.</p>
<h3>My Third Heading</h3>
<p>My third paragraph.</p>
There are two types of paths: relative and absolute
The unique path ( or absolute path ) to My third paragraph. is /html/body/div/div/p
A relative path to My third paragraph. is //body/div/div/p
For My Second Heading. => //body/div/h2
For My first paragraph. => //body/p
Notice that I'm using //body. Relative paths use // to skip right to the desired element.
The usage of //<path> also implies that it should look for all occurrences of <path> in the document, regardless of what came before <path>.
For example, //div/p returns both My second paragraph. and My third paragraph.
You can test this example in your browser to get a better overview!
Paste the code into a .html file and open it with your browser. Open the developer tools and hit control + F. Paste the XPath locator into the small input bar and press enter.
You can also get the XPath of any tag by right-clicking on it in the Elements tab and selecting “Copy XPath”
Notice how I’m switching between “My second paragraph.” and “My third paragraph.”
Also, another important thing to know is that it is not necessary for a path to contain // in order to return multiple elements. Let's see what happens when I add another <p> in the last <div>.
/html/body/div/div/p is no longer an absolute path.
If you followed me this far, congratulations, you’re on the right track to XPath mastery. You are now ready to dive into the fun stuff.
The Square Brackets
You can use the square brackets to select specific elements.
In this case, //body/div/div/p only selects the last <p> tag.
You can also use attributes to select your elements.
//body//p[@class="not-important"] => select all the <p> tags that are inside a <body> tag and have the "not-important" class.
//div[@id] => select all the <div> tags that have an id attribute.
//div[@class="p-children"][@id="important"]/p => select the third <p> that is within a <div> tag that has both class="p-children" and id="important"
//div[@class="p-children" and @id="important"]/p => same as above
//div[@class="p-children" or @id="important"]/p => select the third <p> that is within a <div> that has class="p-children" or id="important"
Notice @ marks the start of an attribute
XPath provides a set of useful functions that you can use inside the square brackets.
position() => returns the index of the element
Ex: //body/div[position()=1] selects the first <div> in the <body>
last() => returns the last element
Ex: //div/p[last()] selects all the last <p> children of all the <div> tags
count(element) => returns the number of elements
Ex: //body/count(div) returns the number of child <div> tags inside the <body>
node() or * => returns any element
Ex: //div/node() and //div/*=> selects all the children of all the <div> tags
text() => returns the text of the element
Ex: //p/text() returns the text of all the <p> elements
concat(string1, string2) => merges string1 with string2
contains(@attribute, "value") => returns true if @attribute contains "value"
//p[contains(text(),"I am the third child")] selects all the <p> tags that have the "I am the third child" text value.
starts-with(@attribute, "value") => returns true if @attribute starts with "value"
ends-with(@attribute, "value") => returns true if @attribute ends with "value"
substring(@attribute,start_index,end_index)] => returns the substring of the attribute value based on two index values
//p[substring(text(),3,12)="am the third"] => returns true if text() = "I am the third child"
normalize-space() => acts like text(), but it removes the trailing spaces
Ex: normalize-space(" example ") = "example"
string-length() => returns the length of the text
Ex: //p[string-length()=20] returns all the <p> tags that have the text length of 20
The functions can be a little tricky to remember. Luckily, The Ultimate Xpath Cheat Sheet provides helpful examples:
//p[text()=<expression_return_value>] will select all the <p> elements that have the text value equal to the return value of the condition.
//p[@class="not-important"]/text() returns the text values of all the <p> tags that have class="not-important".
If there is only one <p> tag that satisfies this condition, then we can pass the return_value to the substring function.
substring(return_value,1,15) will return the first 15 characters of the return_value string.
substring(text(),16,20) will return the last 5 characters of the same
text() value that we used in //p[text()=<expression_return_value>].
Finally, concat() will merge the two substrings and create the return value of <expression_return_value>.
XPath supports path nesting. That’s cool, but what exactly do I mean by path nesting?
Let's try something new: /html/body/div[./div[./p]]
You can read it as "Select all the <div> sons of the <body> that have a <div> child. Also, the children must also be parents to a <p> element."
If you don't care about the father of the <p> element, you can write: /html/body/div[.//p]
This now translates to "Select all the div children of the body that have a <p> descendant"
In this particular example, /html/body/div[./div[./p]] and /html/body/div[.//p] yield the same result.
By now, I'm sure that you are wondering what is up with those dots in ./ and .//
The dot represents the self element. When used in a pair of brackets, it references the specific tag that opened them. Let's dive a little deeper.
In our example, /html/body/div returns two divs:
<div class="no-content"> and <div class="content">
/html/body/div[.//p] translates to:
/html/body/div[/html/body/div//p] is true, so it returns /html/body/div
In our case, the dot ensures that /html/body/div and /html/body/div//p refer to the same <div>
Now let's look at what would have happened if it didn't.
/html/body/div[/html/body/div//p] would return both
<div class="no-content"> and <div class="content">
Why? Because /html/body/div//p is true for both /html/body/div and /html/body/div.
/html/body/div[/html/body/div//p] actually translates to "Select all the div children of the <body> if /html/body/div//p is true.
/html/body/div//p is true if the body has a <div> child, and that child has a <p> descendent". In our case, this statement is always true.
It’s a shame that other Xpath Cheat Sheets don’t mention anything about nesting. I consider it amazing. It enables you to scan the document for different patterns and come back to return something else. The only downside is that writing queries this way can become hard to follow. The good news is, there are other ways of doing it.
You can use axes to locate nodes relative to other context nodes.
Let’s explore some of them.
The Four Main Axes
//p/ancestor::div => selects all the divs that are ancestors of <p>
How I read it: Get all the <p> tags, for each <p> look through its ancestors. If you find <div> tags, select them.
//p/parent::div => selects all the <div> tags that are parents of <p>
How I read it: Get all the <p> tags and of all their parents, if the parent is a <div>, select it.
//div/child::p=> selects all the <p> tags that are children of <div> tags.
How I read it: Get all the <div> tags and their children, if the child is a <p>, select it.
//div/descendant::p => selects all the <p> tags that are descendants of <div> tags.
How I read it: Get all the <div> tags and their descendants, if the descendant is a <p>, select it.
Now it’s time to rewrite the previous expression:
/html/body/div[./div[./p]] is equivalent to /html/body/div/div/p/parent::div/parent::div
But /html/body/div[.//p] is NOT equivalent to /html/body/div//p/ancestor::div
The good news is that we can tweak it a little bit.
/html/body/div//p/ancestor::div[last()] is equivalent to /html/body/div[.//p]
Other Important Axes
//p/following-sibling::span => for each <p> tag, select its following <span> siblings.
//p/preceding-sibling::span => for each <p> tag, select its preceding <span> siblings.
//title/following::span => selects all the <span> tags that appear in the DOM after the <title>.
In our example, //title/following::span selects all the <span> tags in the document.
//p/preceding::div => selects all the <div> tags that appear in the DOM before any <p> tag. But it ignores ancestors, attribute nodes and namespace nodes.
In our case, //p/preceding::div only selects <div class="p-children"> and <div class="no_content">.
Most of the <p> tags are in <div class="content">, but this <div> is not selected because it is a common ancestor for them. As I mentioned, the
preceding axe ignores ancestors.
<div class="p-children"> is selected because it is not an ancestor for the <p> tags inside <div class="p-children" id="important">
Congratulations, you made it. You added a brand new tool to your selector toolbox! If you're building a web scraper or automating web tests, this Xpath Cheat Sheet will come in handy! If you're looking for a smoother way of traversing the DOM, you're in the right place. Regardless, it is worth giving XPath a try. Who knows, maybe you’ll uncover even more use cases for it.
Does the concept of web scraping sound interesting to you? You can contact us here WebScrapingAPI - Contact. If you want to scrape the web, we are happy to assist you along the way. In the meantime, consider trying WebScrapingAPI - Product for free.