Back to Blog
Guides
Andrei OgiolanLast updated on May 7, 202615 min read

How to Scrape HTML Tables Using Python

How to Scrape HTML Tables Using Python
TL;DR: Most HTML tables can be scraped with a single line of pandas.read_html. When the table is paginated, JavaScript-rendered, or has merged headers, switch to Requests + BeautifulSoup or a headless browser like Playwright. This guide gives you a decision matrix, working code for all three approaches, and the cleaning steps that turn scraped rows into pipeline-ready data.

Tabular data is everywhere on the public web, from Wikipedia infoboxes and stock screeners to government statistics, sports stats, and product comparison pages. If you know how to scrape HTML tables using Python, you can turn those rows into clean DataFrames, JSON documents, or rows in your own database in minutes.

The catch is that HTML table is a deceptively wide category. Some tables sit cleanly inside <table> markup that pandas can parse with one line. Others are hand-rolled grids of <div>s, paginated across dozens of pages, or only populated after JavaScript runs in the browser. A method that works perfectly on Wikipedia might silently return zero rows on a single-page app.

This guide walks through three Python approaches and frames the entire article around two practical questions: which method should you reach for, and how do you keep your scraper running when the site changes its markup next quarter?

How to Scrape HTML Tables Using Python: a Quick Decision Matrix

Before you write a line of code, decide which tool fits the table in front of you. Picking wrong is the most common reason tutorials don't survive contact with real websites. Use the matrix below to self-select.

Criterion

pandas.read_html

Requests + BeautifulSoup

Playwright (or Selenium)

Best when

Table is in initial HTML and well-formed

You need per-cell control or filtering

Table is rendered by JavaScript

Lines of code

~3

30 to 80

40 to 100

Speed per page

Fast

Fast

Slow (full browser)

Handles JS

No

No

Yes

Pagination

Manual loop

Manual loop or hidden API

Click and scroll

Resilience to markup churn

Medium

High (you write selectors)

High

Memory footprint

Low

Low

High

Three rules of thumb:

  • If pd.read_html(url) returns the rows you expect, stop there. The one-liner is the most maintainable code you will ever write.
  • If the table is in the HTML but you need to filter, merge, or normalize cells before they hit a DataFrame, reach for Requests + BeautifulSoup.
  • If "View Page Source" shows an empty <div id="grid"> and the data only appears after the page loads, you need Playwright or a hidden JSON endpoint.

The rest of this article shows how to scrape HTML tables using Python under each of those scenarios, plus the edge cases that derail otherwise-working code.

HTML Table Anatomy (and What Makes Scraping Tricky)

A textbook HTML table looks like this:

<table id="employees" class="stripe">
  <thead><tr><th>Name</th><th>Position</th><th>Salary</th></tr></thead>
  <tbody>
    <tr><td>Ada Lovelace</td><td>Engineer</td><td>$120,000</td></tr>
    <tr><td>Alan Turing</td><td>Researcher</td><td>$135,000</td></tr>
  </tbody>
</table>

Five tags do most of the work: <table> is the container, <thead> and <tbody> group rows, <tr> is a row, and <th> or <td> are header and data cells respectively. Two attributes complicate things: colspan makes a cell span multiple columns, and rowspan makes it span multiple rows. Both are used heavily in financial and sports tables.

In the wild, half of these conventions are skipped. Plenty of pages omit <thead> and <tbody>, leave closing tags off, or render tables as nested <div> grids that no parser will recognize as a table at all. Real-world scraping is mostly about coping with that drift, which is why pandas alone is not enough on every site.

Method 1: pandas.read_html, the One-Liner

pandas.read_html is a convenience function in the pandas data-manipulation library that takes a URL or HTML string and returns a list of DataFrames, one per <table> it can find. According to the pandas documentation, it requires either lxml, html5lib, or bs4 under the hood, and it identifies tables by looking for standard table elements.

The whole appeal is that you can write three lines of code and have a typed, queryable DataFrame:

import pandas as pd

tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue")
df = tables[0]
print(df.head())

The catch is that read_html only sees what's already in the response body. If the table is filled in by JavaScript after the page loads, the function raises ValueError: No tables found even though the table is plainly visible in your browser. Knowing that limitation up front saves a lot of debugging.

Setting Up Your Python Environment

You can run every example in this guide with a fresh virtual environment and three packages:

python -m venv .venv
source .venv/bin/activate
pip install pandas requests beautifulsoup4 lxml html5lib playwright
playwright install chromium

lxml is the fastest HTML parser available to Python and is what most pros default to. html5lib is slower but follows the WHATWG parsing algorithm, which makes it the most forgiving choice on broken markup. Install both so you can swap parsers when one chokes.

A Complete pandas.read_html Walkthrough

Let's scrape a real, well-formed table: the list of countries by GDP on Wikipedia. The full workflow is four lines.

import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
tables = pd.read_html(url)
print(f"Found {len(tables)} tables on the page")

gdp = tables[2]            # pick the right one by index
gdp.columns = [c[1] if isinstance(c, tuple) else c for c in gdp.columns]
print(gdp.head())

Three things to notice. First, read_html returned a list, so you index into it. Second, Wikipedia tables often have multi-level headers, which pandas exposes as a MultiIndex. The list comprehension flattens it by keeping the lower level. Third, no manual row iteration: every cell already lives in a typed column you can call .sort_values, .groupby, or .to_csv on.

When you only need the data for a quick analysis, this is genuinely all the code you should write.

Troubleshooting pandas.read_html: Common Errors

pd.read_html fails in predictable ways. Memorize these four and you'll resolve most issues in under a minute.

  1. ValueError: No tables found. The page is either JavaScript-rendered or behind a login wall. Skip ahead to the Playwright section.
  2. HTTP 403 or 429 returned by pandas's internal fetcher. The default urllib user agent is being blocked. Fetch the HTML yourself with Requests and pass the string to read_html:
import requests, pandas as pd
headers = {"User-Agent": "Mozilla/5.0 (compatible; analytics-bot/1.0)"}
html = requests.get(url, headers=headers, timeout=15).text
tables = pd.read_html(html)
  1. Wrong table index. Use match= to filter by a string that appears inside the target table, for example pd.read_html(html, match="Population"). This is far more stable than relying on tables[3].
  2. Garbled characters in non-ASCII content. Force an encoding by reading bytes explicitly: response = requests.get(url); response.encoding = "utf-8"; tables = pd.read_html(response.text).

If you're still hitting walls after these fixes, the table almost certainly needs Requests + BeautifulSoup or a headless browser, not more read_html workarounds.

Method 2: Requests + BeautifulSoup, When You Need Control

pandas.read_html is great when you want every cell exactly as it appears in the HTML. The moment you need to filter rows during extraction, join values from two columns, strip currency symbols on the fly, or pull the href out of a linked cell, it stops being the right tool.

That's where Requests + BeautifulSoup comes in. Requests handles the HTTP layer (headers, cookies, sessions, retries), and BeautifulSoup gives you a parse tree you can traverse with CSS selectors, attribute matching, or sibling navigation. If you're new to BeautifulSoup, our deep dive on extracting and parsing web data with Python and BeautifulSoup walks through the API surface in detail. The combo is also what most production scrapers eventually settle on, because every step (fetch, parse, extract, transform) is something you control.

The next three sections show how to scrape HTML tables using Python with this stack: a polite request, a robust selector for the table, and a row loop that doesn't break when a column is added.

Sending Polite, Realistic HTTP Requests

Anti-bot defenses key on a few cheap signals: a missing or default User-Agent, no Accept-Language, no cookies, and traffic that drains a session in one second. Mimic a real browser and reuse a connection:

import requests
from bs4 import BeautifulSoup

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/124.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
})

response = session.get("https://example.com/employees", timeout=15)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")

Three small habits matter. Session() keeps cookies and connection pooling between calls. raise_for_status() turns silent 4xx/5xx responses into exceptions you can retry. And passing "lxml" as the parser is roughly five to ten times faster than the built-in html.parser on large pages.

Locating the Right Table on the Page

Once you have a BeautifulSoup object, the next problem is grabbing the right <table>. Pages routinely have eight to fifteen of them (think: layout tables, sidebar widgets, hidden pagination controls). Try selectors in this order of stability:

# 1. By stable id (best)
table = soup.find("table", id="employees")

# 2. By a class that's specific to this table
table = soup.find("table", class_="data-grid")

# 3. By a CSS selector
table = soup.select_one("section#payroll table.stripe")

# 4. By the heading that precedes it (when classes are dynamic)
heading = soup.find(["h2", "h3"], string=lambda s: s and "Employees" in s)
table = heading.find_next("table") if heading else None

When class names are auto-generated and change on every deploy (a common React pattern), prefer XPath via lxml, since it can express "the third table inside the section whose heading text contains 'Employees'" in one expression. We have a separate guide on XPath versus CSS selectors that goes deeper on this tradeoff.

Iterating Rows and Extracting Cells the Safe Way

Most scraping tutorials show row loops that index cells positionally: cells[0] is name, cells[1] is position, cells[2] is salary. That code breaks the day someone adds a "Department" column. The robust pattern is to read the headers once and zip them with each row.

# Read headers from <thead> if present, else from the first row
header_cells = table.select("thead th") or table.select("tr:first-of-type th, tr:first-of-type td")
headers = [th.get_text(strip=True) for th in header_cells]

rows = []
for tr in table.select("tbody tr") or table.select("tr")[1:]:
    cells = [td.get_text(strip=True) for td in tr.find_all(["td", "th"])]
    if not cells:
        continue
    rows.append(dict(zip(headers, cells)))

print(f"Extracted {len(rows)} rows with {len(headers)} columns")

Three things this gives you for free. New columns flow through automatically because keys come from headers, not indices. Empty rows (often used as visual separators) are skipped. And every cell goes through get_text(strip=True), which collapses whitespace and removes the \n characters that haunt naive cell.text calls. This is the row loop you should copy into every BeautifulSoup project.

Saving Scraped Rows to JSON, CSV, or Parquet

Once you have a list of dicts, persisting them is a one-liner per format:

import json
import pandas as pd

# JSON, human-readable, UTF-8 safe
with open("employees.json", "w", encoding="utf-8") as f:
    json.dump(rows, f, indent=2, ensure_ascii=False)

# CSV via pandas (handles quoting, encoding, and missing keys)
df = pd.DataFrame(rows)
df.to_csv("employees.csv", index=False, encoding="utf-8")

# Parquet for analytical pipelines (smaller files, typed columns)
df.to_parquet("employees.parquet", index=False)

Reach for JSON when the consumer is a script or a frontend, CSV when a human is going to open it in Excel or BigQuery, and Parquet when the dataset crosses a few hundred thousand rows or feeds Spark, Snowflake, or DuckDB. Parquet files are typically 5 to 10 times smaller than equivalent CSVs and preserve dtypes. For anything destined for a relational database, jump straight to df.to_sql so you skip the intermediate file entirely.

Handling Complex Headers with colspan and rowspan

Two-row headers are common in finance, government statistics, and sports tables. The top row groups columns ("Q1 2024", "Q2 2024"), and the bottom row labels them ("Revenue", "Profit"). Hardcoding column names like ["Name", "Position", "Contact"] works once and breaks forever. Here's a generic algorithm that respects colspan and rowspan.

def expand_header(table):
    # Return a flat list of column labels from a multi-row <thead>
    rows = table.select("thead tr")
    if not rows:
        return [th.get_text(strip=True) for th in table.select("tr:first-of-type th")]

    grid = []  # grid[row_index] = list of column labels at that row
    for r, tr in enumerate(rows):
        while len(grid) <= r:
            grid.append([])
        col = 0
        for th in tr.find_all(["th", "td"]):
            # skip already-filled slots from previous rowspans
            while col < len(grid[r]) and grid[r][col] is not None:
                col += 1
            text = th.get_text(strip=True)
            colspan = int(th.get("colspan", 1))
            rowspan = int(th.get("rowspan", 1))
            for dr in range(rowspan):
                while len(grid) <= r + dr:
                    grid.append([])
                row_buf = grid[r + dr]
                # pad
                while len(row_buf) < col + colspan:
                    row_buf.append(None)
                for dc in range(colspan):
                    row_buf[col + dc] = text
            col += colspan

    # Combine the columns of each row, top-down, into a single label per column
    n_cols = max(len(r) for r in grid)
    flat = []
    for c in range(n_cols):
        parts = [grid[r][c] for r in range(len(grid)) if c < len(grid[r]) and grid[r][c]]
        # de-dup adjacent identical strings: ['Q1 2024', 'Q1 2024', 'Revenue'] -> 'Q1 2024 Revenue'
        seen = []
        for p in parts:
            if not seen or seen[-1] != p:
                seen.append(p)
        flat.append(" ".join(seen))
    return flat

Pair this with the same zip(headers, cells) row loop from earlier and you have header parsing that survives any combination of merged cells. The same idea (a 2D grid you fill in colspan-by-colspan) extends to the body when rowspans repeat values down columns: track which slots are already claimed and skip them in subsequent <tr> iterations.

Scraping Paginated HTML Tables (Three Strategies)

Pagination is the single most underestimated part of how to scrape HTML tables using Python. Most tutorials only show "click the next button in a headless browser", which is the slowest and most fragile approach. Try these three first, in order of preference.

1. Increase the page-size query parameter. Many tables accept ?per_page=500 or ?length=1000. One request, all rows, no looping. Inspect the URL when you click the page-size dropdown and you'll often find this for free.

2. Hit the underlying JSON API. Open DevTools, switch to the Network tab, filter by Fetch/XHR, and click the next page. Almost every modern data table is backed by an endpoint that returns JSON. Calling it directly skips HTML parsing entirely:

import requests
url = "https://example.com/api/employees"
all_rows = []
for page in range(1, 20):
    payload = requests.get(url, params={"page": page, "size": 100}, timeout=15).json()
    if not payload["items"]:
        break
    all_rows.extend(payload["items"])

3. Loop through page query strings. When the URL contains the page number (?page=2, &start=20), iterate it explicitly and stop when the table comes back empty. This is more reliable than driving a browser because there's nothing to click and no animation to wait for.

A headless browser is your last resort, not your first. Save it for tables where the next-page link is bound to a JavaScript handler with no URL change.

Method 3: Playwright for JavaScript-Rendered Tables

When the table only appears after the page hydrates, you need something that runs JavaScript. Playwright is the modern choice: it ships official Python bindings, runs Chromium, Firefox, or WebKit, and has solid auto-wait behavior. Here's the full template for how to scrape HTML tables using Python that depend on JS:

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import pandas as pd

URL = "https://example.com/dashboard"

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page(user_agent="Mozilla/5.0 ... Chrome/124.0 Safari/537.36")
    page.goto(URL, wait_until="domcontentloaded")

    # Wait for the actual data, not just the page load
    page.wait_for_selector("table#grid tbody tr", timeout=15000)

    html = page.content()
    browser.close()

# Hand the rendered HTML off to your existing parser
soup = BeautifulSoup(html, "lxml")
table = soup.find("table", id="grid")
# ... use the same row loop from earlier ...

# Or, when the table is well-formed, skip BeautifulSoup entirely:
df = pd.read_html(html, match="Department")[0]
print(df.head())

The pattern is always the same: navigate, wait for data (not just load), grab page.content(), then feed that string into the same parsing code you'd use for static HTML. Refer to the Playwright for Python documentation for installation, async APIs, and tracing.

Selenium and Pyppeteer are valid alternatives. Selenium has the larger ecosystem and is the safe pick if your team already uses it for end-to-end tests, and our Selenium step-by-step tutorial covers the equivalent setup. Pyppeteer is leaner but less actively maintained. For a fuller comparison of headless tooling, see our Playwright web scraping guide. For new projects, Playwright tends to be the most ergonomic.

Choosing an HTML Parser and Handling Empty Cells

BeautifulSoup is a wrapper. The actual parsing is delegated to one of three backends, and the choice matters more than most tutorials admit.

Parser

Speed

Tolerance for bad HTML

Install

html.parser

Slow

Medium (built into Python)

None

lxml

Fast

Strict-ish, but pragmatic

pip install lxml

html5lib

Slowest

Highest, follows WHATWG

pip install html5lib

Default to lxml. Switch to html5lib only when lxml returns a partial tree on a page with broken markup (missing closing </td>, unclosed <tr>, stray < characters). You can verify quickly:

import time
from bs4 import BeautifulSoup

for parser in ["lxml", "html.parser", "html5lib"]:
    t0 = time.perf_counter()
    soup = BeautifulSoup(html, parser)
    rows = soup.select("tbody tr")
    print(f"{parser:10} {len(rows):4} rows in {time.perf_counter()-t0:.3f}s")

For empty cells, write a helper that returns a sensible default rather than crashing:

def cell_text(cell, default=""):
    if cell is None:
        return default
    text = cell.get_text(" ", strip=True)
    return text if text else default

Use it everywhere you index into a row. None checks at every call site clutter the loop and miss the case where the cell exists but contains only &nbsp;. This helper handles both.

Avoiding Blocks: Headers, Sessions, and Proxies

A 200 status code means the request was accepted. Anything else (especially 403, 429, or 503) usually means the site spotted your scraper. Climb this ladder in order, stopping at the first rung that works.

  1. Realistic headers. Set User-Agent, Accept-Language, and Referer to values a real Chrome session would send. This alone fixes a surprising number of blocks.
  2. Persistent sessions. Use requests.Session() so cookies set by the home page are sent on subsequent calls. Many sites issue a session cookie on the first hit and reject requests that lack it.
  3. Exponential backoff on 429 and 503. Sleep 2 ** attempt seconds and retry up to five times. Honor Retry-After headers when the server provides them.
  4. Datacenter proxies. Cheap, fast, and enough for most static sites. Rotate IPs across your worker pool.
  5. Residential proxies. Real residential IPs from 195 countries, used when datacenter ranges are already blocked. Slower but harder to detect.
  6. Managed scraping APIs. When you want to focus on parsing rather than infrastructure, services like our Scraper API at WebScrapingAPI handle proxy rotation, header generation, and retries behind a single endpoint, so the same BeautifulSoup or pandas code keeps working.

Most projects need rungs one through three. For a longer checklist of detection signals, our guide on why scrapers get blocked or IP-banned digs into TLS fingerprinting, header order, and rate-limit math. If you're getting blocked on a Wikipedia article, something else is wrong.

Cleaning, Type Coercion, and Export for Production

Scraped tables are almost never ready for analysis. Currency symbols, percent signs, footnote markers, and trailing whitespace all sneak in as strings. Fix them in one pass before saving:

import pandas as pd

df = pd.DataFrame(rows)

# 1. Strip whitespace on every text column
str_cols = df.select_dtypes(include="object").columns
df[str_cols] = df[str_cols].apply(lambda s: s.str.strip())

# 2. Coerce numeric columns (errors='coerce' turns junk into NaN)
df["salary"] = pd.to_numeric(df["salary"].str.replace(r"[^0-9.\-]", "", regex=True),
                             errors="coerce")
df["growth_pct"] = pd.to_numeric(df["growth_pct"].str.rstrip("%"), errors="coerce")

# 3. Coerce dates
df["hired_at"] = pd.to_datetime(df["hired_at"], errors="coerce")

# 4. Drop rows where the primary key failed to parse
df = df.dropna(subset=["employee_id"])

# 5. Persist
df.to_parquet("employees.parquet", index=False)
df.to_sql("employees", con=engine, if_exists="replace", index=False)

The errors="coerce" flag is the underrated hero of this pipeline: bad cells become NaN instead of raising, and you can investigate them later with df[df["salary"].isna()]. For production pipelines, write Parquet for storage and use to_sql to land cleaned data in Postgres or your warehouse of choice.

This is risk reduction guidance, not legal advice. Talk to a lawyer before scraping anything sensitive.

  • Read robots.txt. It expresses the site owner's preference, not a legal rule, but ignoring it is a fast way to get blocked. The spec is documented in RFC 9309.
  • Read the Terms of Service. Logged-in scraping in particular often violates the ToS even when robots.txt is silent.
  • Rate limit yourself. One request per second is a reasonable default for small projects. Add jitter so you don't look like a clock.
  • Avoid personal data unless you have a lawful basis. GDPR and similar laws apply even when the data is technically public.
  • Attribute when republishing. Cite the source URL and the scrape date.

Knowing how to scrape HTML tables using Python is half technical, half ethical. The technical half breaks once; the ethical half can break your company.

Key Takeaways

  • Pick the simplest tool that works. pandas.read_html for clean static tables, Requests + BeautifulSoup for control, Playwright for JS-rendered or interaction-driven tables.
  • Headers, not indices. Zip header text with cell text so your scraper survives an added column. Hardcoded cells[0], cells[1] is technical debt.
  • Pagination has three layers. Try per_page=500, then a hidden JSON API, then page-number loops. A headless browser is the last resort.
  • Clean before you save. pd.to_numeric, pd.to_datetime, and errors="coerce" turn dirty scraped rows into a typed DataFrame ready for analysis.
  • Respect the site. Honor robots.txt, throttle requests, and avoid personal data unless you have a clear lawful basis.

FAQ

What is the difference between pandas.read_html and BeautifulSoup for scraping tables?

pandas.read_html is a high-level shortcut: it returns DataFrames directly but only handles tables already in the HTML response. BeautifulSoup is a lower-level HTML parser that gives you full control over which cells you keep, how you transform them, and how to navigate non-standard markup. Use read_html for analysis-ready data, BeautifulSoup when the rules you need can't be expressed as "give me table N".

How do I scrape an HTML table that only appears after JavaScript runs?

First confirm it's actually JavaScript-rendered: view page source (Ctrl+U), search for a word from the table, and if it's missing, the table is hydrated client-side. The fastest fix is finding the underlying JSON endpoint in DevTools' Network tab and calling it directly. If that's not viable, drive a headless browser like Playwright, wait on a row selector, then pass page.content() to your usual parser.

What should I do when a table has merged cells (rowspan or colspan)?

Treat the table as a 2D grid you fill in cell-by-cell, respecting colspan and rowspan attributes, instead of a list of rows. For each <th> or <td>, repeat its value across the slots its span covers, and skip slots already filled by an earlier rowspan. This produces a rectangular matrix you can hand to pd.DataFrame without column-count mismatches.

How do I keep numeric and date columns correctly typed after scraping a table?

Strip non-numeric characters with a regex (str.replace(r"[^0-9.\-]", "", regex=True)), then call pd.to_numeric(series, errors="coerce") so unparseable values become NaN instead of raising. For dates, pd.to_datetime(series, errors="coerce", format="%Y-%m-%d") is the equivalent. Adding the format argument makes parsing roughly 10x faster on large columns and prevents false positives from ambiguous strings.

Can I run pandas.read_html on a local HTML file or a raw HTML string?

Yes. pd.read_html accepts a URL, a path to a local file, or a raw HTML string. Pass pd.read_html(open("page.html").read()) to feed it a string, or pd.read_html("page.html") for a file path. This is useful for unit tests (commit a known-good HTML fixture) and for separating fetching from parsing in production scrapers.

Wrapping Up

Knowing how to scrape HTML tables using Python is mostly about matching the tool to the table. Reach for pandas.read_html first, graduate to Requests + BeautifulSoup when you need cell-level control, and only spin up Playwright when JavaScript renders the data. Layer on header-aware row loops, generic colspan/rowspan parsing, smart pagination, and a pandas cleaning pass, and you have a scraper that survives markup changes instead of breaking on the next deploy.

When you outgrow do-it-yourself proxy rotation and JavaScript rendering, WebScrapingAPI offers a Scraper API that handles the request layer behind a single endpoint, so your parsing code keeps working. From here, browse our deeper guides on JavaScript tables and on avoiding blocks.

About the Author
Andrei Ogiolan, Full Stack Developer @ WebScrapingAPI
Andrei OgiolanFull Stack Developer

Andrei Ogiolan is a Full Stack Developer at WebScrapingAPI, contributing across the product and helping build reliable tools and features for the platform.

Start Building

Ready to Scale Your Data Collection?

Join 2,000+ companies using WebScrapingAPI to extract web data at enterprise scale with zero infrastructure overhead.