Back to Blog
Guides
Mihai MaximLast updated on May 12, 202611 min read

HTML Parsing in Java with Jsoup

HTML Parsing in Java with Jsoup
TL;DR: Jsoup is the default library for HTML parsing in Java. This guide walks the full lifecycle (Maven setup, loading a Document, CSS selectors, DOM traversal, extraction, modification, and serialization), plus a runnable scraping project, error handling, pagination, and the limits that push you toward a headless browser or scraping API.

If you need to extract or rewrite HTML inside a JVM service, you have a few options, but for most real jobs HTML parsing in Java still starts and ends with Jsoup. Web scraping is the automated extraction of data from a site's HTML source, and Jsoup is the open-source library that turns that source into a navigable DOM you can query with CSS selectors and modify in place.

This Jsoup tutorial is built for intermediate Java developers (backend engineers, data engineers, SEO and QA folks, anyone running content migrations) who want a hands-on walkthrough instead of a marketing overview. We cover Maven setup, loading a Document from a String, File, or URL, configuring the HTTP request, handling errors, traversing and selecting elements, extracting text and attributes, modifying nodes, and serializing the result back to clean HTML. A full runnable scraping project closes the article, with pagination and rate-limiting notes.

We are also honest about the limits: Jsoup does not run JavaScript, rotate IPs, or bypass anti-bot defences. The closing section maps where it runs out of road and what to reach for next.

Why Jsoup Is the Default Choice for HTML Parsing in Java

When the data you need lives in a public web page and the site has no API, you write a scraper. For HTML parsing in Java, Jsoup has been the practical default for years: open source, steady releases, solid docs, and a fluent API that ports cleanly from jQuery or vanilla DOM JavaScript. Crucially, it covers both halves of the workflow: read HTML and write HTML.

What Jsoup Can and Cannot Do at a Glance

Jsoup implements the WHATWG HTML5 specification, so it parses just about any markup, from pristine to genuinely broken, the way a modern browser would. You get a DOM tree, jQuery-style selectors, and methods for both reading and writing. What it does not do is execute JavaScript. Anything injected by a client-side framework after the initial response (a React store, lazy-loaded rows, hydration-gated content) is invisible to Jsoup. That ceiling drives the limitations section later.

Setting Up Jsoup in a Maven Project

Spin up a Maven skeleton with mvn archetype:generate -DarchetypeArtifactId=maven-archetype-quickstart, then add the Jsoup dependency to pom.xml. At the time of writing, the current line is the 1.17.x series, but always confirm the latest stable release on Maven Central before pinning a version in production:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.17.x</version>
</dependency>

If you want to run the examples with mvn exec:java, register the Exec Maven Plugin in your <plugins> block. Use whatever version its plugin page lists as current; older tutorials cite 3.0.0, which may already be stale. Not using Maven? Jsoup ships as a single JAR you can drop onto the classpath, and Gradle users can declare implementation 'org.jsoup:jsoup:1.17.x'.

Loading HTML into a Jsoup Document

The entry point for HTML parsing in Java with this library is the Jsoup class. You can parse HTML from a String, a File, an InputStream, or (most commonly) fetch it directly from a URL. A clean Jsoup connect example looks like this:

// From a string, great for unit tests
Document fromString = Jsoup.parse("<html><body><p>Hello</p></body></html>");

// From a local file
Document fromFile = Jsoup.parse(new File("page.html"), "UTF-8");

// From a live URL: issues the HTTP request, then parses the response
Document doc = Jsoup.connect("https://example.com").get();

Compared with rolling your own HttpURLConnection, the fluent Jsoup.connect(...) API saves a lot of boilerplate: it manages the socket, reads the body, decodes the charset, and returns a parsed Document in one call. That Document is the in-memory DOM you work with for everything else, from CSS selectors to DOM modification.

Configuring the Jsoup Connection (headers, cookies, timeouts, user agent)

Jsoup.connect(url) returns a Connection object you can configure before issuing the request. Defaults are fine for friendly endpoints, but most real targets need at least a real User-Agent and a sensible timeout:

Document doc = Jsoup.connect("https://example.com/listing")
    .userAgent("Mozilla/5.0 (compatible; MyJavaScraper/1.0; +https://yourdomain.tld/bot)")
    .referrer("https://example.com")
    .header("Accept-Language", "en-US,en;q=0.9")
    .cookie("session", "abc123")
    .timeout(10_000)
    .data("q", "java")
    .method(Connection.Method.GET)
    .get();

Pick a User-Agent that identifies your bot honestly. Many servers return a stripped-down response, or block outright, when the UA looks like a default Java HTTP client.

Handling HTTP Errors, Status Codes, and Timeouts

Two exceptions matter here. HttpStatusException is thrown when the server returns a 4xx or 5xx and gives you both the offending URL and status code. IOException covers everything else: DNS failures, connection resets, socket timeouts. Catch both:

try {
    Document doc = Jsoup.connect(url).timeout(10_000).get();
} catch (HttpStatusException e) {
    log.warn("Bad status {} for {}", e.getStatusCode(), e.getUrl());
} catch (IOException e) {
    // retry with exponential backoff, then escalate
}

If you actually need the body of a 404 page (for soft-404 detection), chain .ignoreHttpErrors(true) before .get(). Wrap network calls in a retry loop with exponential backoff for production scrapers; transient 5xx and reset errors are normal at scale.

Selecting Elements with Jsoup CSS Selectors

Once you have a Document, querying it is a one-liner. Document.select(String cssQuery) accepts the same syntax you would use in querySelectorAll and returns an Elements collection that is never null, even when nothing matches. That alone removes a whole class of NullPointerExceptions you would otherwise hit with naïve DOM code.

The Jsoup CSS selectors vocabulary goes well past tag and class. A short tour worth bookmarking next to any CSS selectors cheat sheet:

Selector

Matches

div.post-card

<div> with class post-card

article > h2

direct-child h2 of an article

a[href^=https]

links whose href starts with https

img[src*=authors]

images whose src contains the substring authors

li:nth-child(2)

second li in its parent

section:has(h2)

sections that contain at least one h2

p:contains(error)

paragraphs containing the literal text "error"

Combine these freely. A durable pattern is to scope a child selector against a previously selected Element rather than re-running queries from the document root.

getElementById, getElementsByClass, and select Compared

Jsoup mirrors the JavaScript DOM API for readers who want explicit getters. getElementById(id) returns the single Element (or null) with that ID, identical to document.getElementById in a browser. getElementsByClass(name) returns all matches, just like document.getElementsByClassName. select(cssQuery) is the equivalent of querySelectorAll and is the most flexible of the three.

Use the explicit getters when intent is obvious (a stable ID or single semantic class) and select() when you need composition or attribute filtering. One real-world warning: avoid framework-generated utility classes as anchor selectors. A Tailwind class like p-[10px] or text-slate-700 is a build-output detail that can vanish on the next deploy. Lean on stable IDs, ARIA roles, or semantic tags, and your scrapers age far better.

Traversing the DOM Tree: parents, siblings, children, first/last/nth

Selectors get you in the door; traversal gets you to siblings and ancestors. The Jsoup Document API exposes parent(), parents(), children(), and siblingElements(), plus indexed access via first(), last(), and get(int n). Every Element also has its own select() method that scopes a query to that subtree, which is the cleanest way to write resilient selectors:

Element card = doc.selectFirst("article.post-card");
String title  = card.select("h2 > a").text();
String author = card.parent().select(".byline").text();
Elements tags = card.children().select("span.tag");

Walking up to a stable ancestor and back down is far more durable than chaining brittle class selectors from the document root, especially on pages that ship CSS-in-JS or utility-class frameworks.

Extracting Text, HTML, and Attributes from Elements

Once you have selected an Element, four methods cover almost every case when you extract data from HTML in Java. text() returns visible, whitespace-collapsed text (analogous to innerText). html() returns inner HTML as a string. outerHtml() includes the element's own tags. ownText() returns only the element's direct text nodes, skipping descendants.

For attributes, attr("href") reads a value and absUrl("href") resolves relative URLs against the document's base URI, which is invaluable when scraping link lists. Iteration is straightforward, since Elements is Iterable:

for (Element link : doc.select("a[href]")) {
    System.out.println(link.text() + " -> " + link.absUrl("href"));
}

You can also stream, use forEach, or pull by index with get(n). Whichever feels most idiomatic to your codebase is fine.

Modifying and Outputting HTML with Jsoup

Most tutorials stop at extraction, but HTML parsing in Java with Jsoup is genuinely bidirectional. The same attr(), text(), and html() methods double as setters. You can build new nodes with new Element(Tag.valueOf("...")), attach them with appendChild() or appendElement(), and delete with remove(). The Jsoup modify HTML surface looks like this:

Document doc = Jsoup.parse(rawHtml);

// Edit existing nodes
doc.select("a.tracker").forEach(a -> a.attr("rel", "nofollow"));
doc.selectFirst("h1").text("Updated Title");

// Add a new node
Element note = new Element(Tag.valueOf("p"), "")
    .text("Edited by my scraper at " + Instant.now());
doc.body().appendChild(note);

// Remove ad slots
doc.select("div.ad-slot").remove();

// Serialize back to a clean HTML string
String cleaned = doc.html();

That round-trip (parse, mutate, serialize) is what makes Jsoup useful for content migrations, HTML sanitization, and feed normalization, not just one-off scraping.

Practical Project: Scraping a Blog Listing End-to-End

To tie everything together, build a small scraper that extracts the title, link, header image, and author avatar from each post card on a public blog listing. Open the page in DevTools first; manual reconnaissance beats guesswork every time. Identify a stable container selector per card, then write field-by-field selectors against it.

Document doc = Jsoup.connect("https://example.com/blog")
    .userAgent("MyJavaScraper/1.0")
    .timeout(10_000)
    .get();

for (Element card : doc.select("article.post-card")) {
    String title   = card.select("h2 > a").text();
    String url     = card.select("h2 > a").absUrl("href");
    String header  = card.selectFirst("img.header-image").absUrl("src");
    String avatar  = card.select("img[src*=authors]").attr("abs:src");

    System.out.printf("%s | %s | %s | %s%n", title, url, header, avatar);
}

Each field has its own intent-driven selector. img[src*=authors] filters by attribute substring, which holds up better than chaining structural selectors when the markup shifts. That kind of structured Java web scraping with Jsoup beats brittle index-based parsing every time.

Looping Through Paginated Pages

Most listings follow a predictable URL scheme such as /blog, /blog/page/2/, /blog/page/3/. Treat page 1 as a special case and loop until you hit an empty result set or a 404 from HttpStatusException. Sleep a second or two between requests, randomize it slightly, and respect the target's robots.txt (RFC 9309). Pagination without rate limiting is the fastest way to get banned and the most common reason people end up reading articles on why you get blocked.

Jsoup Limitations and When to Reach for an Alternative

Jsoup's hard ceiling is JavaScript. It parses what the server initially returns, so anything rendered client-side (React, Vue, or Angular SPAs, lazy-loaded infinite scroll, content gated behind hydration) is invisible. It also ships no support for headless rendering, proxy rotation, or anti-bot bypass.

When the page is dynamic, pair Jsoup with a headless browser: Selenium and Playwright drive a real Chromium; HtmlUnit is a lighter JVM-native option; Jaunt offers a similar Java API with built-in HTTP. When the page is static but hostile (Cloudflare, frequent IP bans, fingerprinting), route the request through a managed scraping API that handles proxies and CAPTCHAs, then feed the response HTML straight back into Jsoup. That keeps your parsing code clean and cuts moving parts.

Wrap-Up: Building Resilient Java HTML Parsers

The full workflow for HTML parsing in Java with Jsoup is four verbs: load, select, extract or modify, then output. For deeper reading, the Jsoup cookbook and Javadocs are the canonical references. Before you start a new scraper, walk a quick decision checklist: Is the page static or JS-rendered? Is the target likely to block me? Do I need to mutate HTML or only read it? Those three answers tell you whether Jsoup alone is enough.

Key Takeaways

  • Use Jsoup for any HTML parsing in Java job where the markup is server-rendered. It handles malformed HTML the way a modern browser would.
  • Jsoup.connect(url).get() collapses fetching and parsing into one call. Always set a real User-Agent and a non-default timeout, and catch both HttpStatusException and IOException.
  • select() returns an Elements list that may be empty but is never null. Prefer stable IDs, ARIA roles, and semantic selectors over framework-generated utility classes.
  • Jsoup is bidirectional: attr, text, and html as setters, plus appendChild and remove, let you edit and re-serialize HTML, not just read it.
  • Jsoup does not execute JavaScript. For SPAs, pair it with Selenium, Playwright, or HtmlUnit; for blocked targets, route the request through a managed scraping API.

FAQ

Can Jsoup scrape JavaScript-rendered or single-page apps?

No. Jsoup only parses the raw HTML the server returns, so anything generated by a client-side framework after page load is invisible to it. To scrape SPAs or pages that hydrate on the client, drive a real or headless browser with Selenium, Playwright, or HtmlUnit, capture the fully rendered HTML, and then hand that string to Jsoup.parse(...) for selector-based extraction.

How is Jsoup different from HtmlUnit, Jaunt, or Selenium for HTML parsing?

Jsoup is a pure HTML parser. It does not execute JavaScript, run a JS engine, or simulate a browser. HtmlUnit and Selenium both render pages with a JS engine (HtmlUnit inside the JVM, Selenium via a real browser driver). Jaunt sits closer to Jsoup as a parser plus simple HTTP client. Use Jsoup when the page is static; use the others when you need rendering or interaction.

How do I avoid getting blocked or rate-limited while parsing pages with Jsoup?

Identify your bot honestly in the User-Agent, throttle requests to a few per second per host, randomize delays, and reuse cookies where appropriate. Read and respect robots.txt. For higher-volume jobs or hostile targets, route requests through a residential or rotating proxy pool, because Jsoup itself has no IP rotation, fingerprint spoofing, or CAPTCHA handling built in.

Can Jsoup parse XML, RSS feeds, or malformed HTML?

Yes to all three. Pass an XML parser explicitly with Jsoup.parse(input, baseUri, Parser.xmlParser()) for RSS feeds, sitemaps, and other XML documents. For malformed HTML, the default parser is forgiving and normalizes markup the way a modern browser would, so unclosed tags and stray characters typically still produce a usable Document.

What is the latest stable Jsoup version and how do I keep it up to date?

Check Maven Central directly, because version numbers change frequently and any number quoted in a tutorial may already be stale. Subscribe to release notes on the Jsoup GitHub repository, or run a Maven dependency-update plugin such as versions:display-dependency-updates in CI to surface available upgrades automatically. Renovate and Dependabot both work if your repo is hosted accordingly.

Conclusion

If you finish this guide and remember one thing, let it be the four-step rhythm: load HTML into a Document, select what you care about, extract or modify it, and serialize back out. That sequence is the spine of every Jsoup-based scraper, every content migration, and every HTML sanitizer you will write. Add a real User-Agent, sane timeouts, structured exception handling, and a retry policy with backoff, and you have a parser that survives production traffic.

The honest caveat still applies: Jsoup does not run JavaScript and does not bypass anti-bot defences. If the page renders client-side, you need a headless browser. If the target blocks your IP or fingerprints your fetcher, you need a smarter request layer.

That second case is where a managed scraping API earns its keep. WebScrapingAPI's Scraper API returns the raw HTML of even hostile targets, handling proxy rotation, CAPTCHAs, and browser fingerprinting on its side, so you can keep your Jsoup parsing code unchanged and just swap the fetch step. It is the cleanest way we have found to bolt production resilience onto a lean Java parser.

About the Author
Mihai Maxim, Full Stack Developer @ WebScrapingAPI
Mihai MaximFull Stack Developer

Mihai Maxim is a Full Stack Developer at WebScrapingAPI, contributing across the product and helping build reliable tools and features for the platform.

Start Building

Ready to Scale Your Data Collection?

Join 2,000+ companies using WebScrapingAPI to extract web data at enterprise scale with zero infrastructure overhead.