Find out how to Scrape HTML Table with Python

Andrei Ogiolan on Apr 11 2023

Introduction

Web scraping is a powerful tool that allows you to extract data from websites and use it for a variety of purposes, such as data mining, data analysis, and machine learning. One common task in web scraping is extracting data from HTML tables, which can be found on a variety of websites and are used to present data in a structured, tabular format. In this article, we will learn how to use Python to scrape data from HTML tables and store it in a format that is easy to work with and analyze.

By the end of this article, you will have the skills and knowledge to build your own web scraper that can extract data from HTML tables and use it for a variety of purposes. Whether you are a data scientist looking to gather data for your next project, a business owner looking to gather data for market research, or a developer looking to build your own web scraping tool, this article will provide a valuable resource for getting started with HTML table scraping using Python.

What are HTML tables?

HTML tables are a type of element in HTML (Hypertext Markup Language) that is used to represent tabular data on a web page. An HTML table consists of rows and columns of cells, which can contain text, images, or other HTML elements. HTML tables are created using the table element, and are structured using the ‘<tr>’ (table row) ,‘<td>’ (table cell), ‘<th>’ (table header), ‘<caption>’ , ‘<col>’, ‘<colgroup>’, ‘<tbody>’ (table body), ‘<thead>’ (table head) and ‘<tfoot>’ (table foot) elements. Now let’s go through each one and get in more detail:

table element: Defines the start and end of an HTML table.
tr (table row) element: Defines a row in an HTML table.
td (table cell) element: Defines a cell in an HTML table.
th (table header) element: Defines a header cell in an HTML table. Header cells are displayed in bold and centered by default, and are used to label the rows or columns of the table.
caption element: Defines a caption or title for an HTML table. The caption is typically displayed above or below the table.
col and colgroup elements: Define the properties of the columns in an HTML table, such as the width or alignment.
tbody, thead, and tfoot elements: Define the body, head, and foot sections of an HTML table, respectively. These elements can be used to group rows and apply styles or attributes to a specific section of the table.

For a better understanding of this concept, let’s see how a HTML table look like:

On the first look it seems like a normal table and we can not see structure with the above described elements. It does not mean they are not present, it means the browser already parsed that for us. In order to be able to see the HTML structure, you need to go one step deeper and use dev tools. You can do that by right-clicking on the page, click on inspect, click on select element tool and click on the element ( table in this case ) you want to see the HTML structure for. After following this steps, you should see something like this:

HTML tables are commonly used to present data in a structured, tabular format, such as for tabulating results or displaying the contents of a database. They can be found on a wide variety of websites and are an important element to consider when scraping data from the web.

Setting up

Before we can start scraping data from HTML tables, we need to set up our environment and make sure that we have all of the necessary tools and libraries installed. The first step is to make sure that you have Python installed on your computer. If you do not have Python installed, you can download it from the official Python website (https://www.python.org/) and follow the instructions to install it.

Next, we will need to install some libraries that will help us scrape data from HTML tables. Some of the most popular libraries for web scraping in Python are Beautiful Soup, Selenium and Scrapy. In this article the focus will be on using Beautiful Soup, since it is very straightforward compared to the other ones. Beautiful Soup is a library that makes it easy to parse HTML and XML documents, and is particularly useful for extracting data from web pages. While this is enough to scrape the HTML data we are looking for, it is not going to be too readable for the human eye in the HTML format, so you may want to parse the data somehow. This is the moment where Pandas library comes into play.

Pandas is a data analysis library that provides tools for working with structured data, such as HTML tables. You can install these libraries using the pip package manager, which is included with Python:

$ pip install beautifulsoup4 pandas

Once you have Python and the necessary libraries installed, you are ready to start scraping data from HTML tables. In the next section, we will walk through the steps of building a web scraper that can extract data from an HTML table and store it in a structured format.

Let’s start scraping

Now that we have our environment set up and have a basic understanding of HTML tables, we can start building a web scraper to extract data from an HTML table. In this section, we will walk through the steps of building a simple scraper that can extract data from a table and store it in a structured format.

The first step is to use the requests library to send an HTTP request to the webpage that contains the HTML table that we want to scrape.

You can install it using pip, as any other Python package:

$ pip install requests

This library allows us to retrieve the HTML content of a web page as a string:

import requests

url = 'https://www.w3schools.com/html/html_tables.asp'

html = requests.get(url).text

Next, we will use the BeautifulSoup library to parse the HTML content and extract the data from the table. BeautifulSoup provides a variety of methods and attributes that make it easy to navigate and extract data from an HTML document. Here is an example of how to use it to find the table element and extract the data from the cells:

soup = BeautifulSoup(html, 'html.parser')

# Find the table element

table = soup.find('table')

# Extract the data from the cells

data = []

for row in table.find_all('tr'):

   cols = row.find_all('td')

   # Extracting the table headers

   if len(cols) == 0:

       cols = row.find_all('th')

   cols = [ele.text.strip() for ele in cols]

   data.append([ele for ele in cols if ele])  # Get rid of empty values

print(data)

The 2D data array is now filled with table rows and columns values. In order for it to be more readable to us we can pass the content to a Pandas Dataframe very easily now:

import pandas as pd

# Getting the headers from the data array

# It is important to remove them from the data array afterwards in order to be parsed correctly by Pandas

headers = data.pop(0)

df = pd.DataFrame(data, columns=headers)

print(df)

Once you have extracted the data from the table, you can use it for a variety of purposes, such as data analysis, machine learning, or storing it in a database. You can also modify the code to scrape multiple tables from the same web page or from multiple web pages.

Please keep in mind that not all the websites on the internet are this easy to scrape data from. Many of them implemented high level protection measures designed to prevent scraping such as CAPTCHA and blocking the IP addresses, but luckily there are 3rd party services such as WebScrapingAPI which offer IP Rotation and CAPTCHA bypass enabling you to scrape those targets.

I hope this section provided a helpful overview of the process of scraping data from an HTML table using Python. In the next section, we will discuss some of the ways you can improve this process and best web scraping practices.

Getting more advanced

While the scraper we built in the previous section is functional and can extract data from an HTML table, there are a number of ways we can improve and optimize it to make it more efficient and effective. Here are a few suggestions:

Handling pagination: If the HTML table you are scraping is spread across multiple pages, you will need to modify the scraper to handle pagination and scrape data from all of the pages. This can typically be done by following links or using a pagination control, such as a "next" button, to navigate to the next page of data.
Handling AJAX: If the HTML table is generated using AJAX or JavaScript, you may need to use a tool like Selenium to execute the JavaScript and load the data into the table. Selenium is a web testing library that can simulate a user interacting with a web page and allow you to scrape data that is dynamically generated. A good alternative to that is to use our scraper which can return the data after JavaScript is rendered on the page. You can learn more about this by checking our docs.
Handling errors: It is important to handle errors and exceptions gracefully in your scraper, as network or server issues can cause requests to fail or data to be incomplete. You can use try/except blocks to catch exceptions and handle them appropriately, such as retrying the request or logging the error.
Scaling the scraper: If you need to scrape a large amount of data from multiple tables or websites, you may need to scale your scraper to handle the increased workload. This can be done using techniques such as parallel processing or distributing the work across multiple machines.

By improving and optimizing your web scraper, you can extract data more efficiently and effectively, and ensure that your scraper is reliable and scalable. In the next section, we will discuss why using a professional scraper service may be a better option than building your own scraper.

Summary

In this article, we covered the basics of web scraping and showed you how to build a simple Python scraper to extract data from an HTML table. While building your own scraper can be a useful and educational exercise, there are a number of reasons why using a professional scraper service may be a better option in many cases:

Professional scrapers are typically more reliable and efficient, as they are designed and optimized for web scraping at scale.
Professional scrapers often have features and capabilities that are not available in homemade scrapers, such as support for CAPTCHAs, rate limiting, and handling AJAX and JavaScript.
Using a professional scraper can save you time and resources, as you don't have to build and maintain your own scraper.
Professional scrapers often offer various pricing options and can be more cost-effective than building your own scraper, especially if you need to scrape large amounts of data.

While building your own scraper can be a rewarding experience, in many cases it may be more practical and cost-effective to use a professional scraper service. In the end, the decision of whether to build your own scraper or use a professional service will depend on your specific needs and resources.

I hope this article provided a helpful overview of web scraping and the process of building a simple HTML table scraper with Python.

News and updates

Stay up-to-date with the latest web scraping guides and news by subscribing to our newsletter.

We care about the protection of your data. Read our Privacy Policy.

Guides How To Scrape Amazon Product Data: A Comprehensive Guide to Best Practices & Tools

Explore the complexities of scraping Amazon product data with our in-depth guide. From best practices and tools like Amazon Scraper API to legal considerations, learn how to navigate challenges, bypass CAPTCHAs, and efficiently extract valuable insights.

Suciu Dan

Aug 10 202315 min read

Guides How To Build a Scraper and Download A File With Puppeteer

Discover 3 ways on how to download files with Puppeteer and build a web scraper that does exactly that.

Mihnea-Octavian Manolache

Apr 25 20238 min read

Guides Find out how to Scrape JavaScript Tables with Python

Learn how to scrape JavaScript tables using Python. Extract data from websites, store and manipulate it using Pandas. Improve efficiency and reliability of the scraping process.

Andrei Ogiolan

Apr 24 20237 min read